<wseltzer> blassey: PEARG is hosting an interim meeting on IP privacy
<blassey> https://mailarchive.ietf.org/arch/msg/pearg/Ok7mxbJn6cZ0lvHVKC3RMSxnxXc/
<scribe> Scribe: Karen
Wendy: anyone who would like to introduce themselves
<wseltzer> mikko: NextRoll, Measurment
Niko from NextRoll
Andrew from IAB Europe
Joey from NextRoll
Wendy: we have 76 people on the
call here
... please "present+" yourself on irc to help us keep a
record
Wendy: Andrew, are you ready to finish discussion of MURRE proposal
<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md
Andrew: If you haven't read the MURRE spec, I will do a quick overview
<GarrettJohnson> Wendy, can you announce this event? Public W3C event on AdTech with Don Marti, Wendy Seltzer, Aram Zucker-Sharff & Robin Berjon on Thursday morning 10am ET (also webstreamed live): https://www.eventbrite.com/e/webinnovationx-back-to-the-future-2020-fall-tickets-122613667781
Andrew: what it is, we wanted to
tackle problem of how to do machine learning (ML)
... in this world
... do some aggregated reports, but have not seen concrete
proposals
... at TPAC, usually ML requires granular data sets
... just a starting point
... I lay out some of weaknesses in the document as well
... let's talk about what it does
... takes a locally differentiated private approach
... it is a relatively simple mechanism
... first thing browser needs to do, for each DSP, would
maintain a state that we internally at NextRoll call a
trail
... things get pushed into this trail; list of JSON
objects
... pixel events, impression, click, conversion events
... get pushed into this timeline object
... the proposal doesn't go into ton of detail of all the
objects
... need time stamp, what type of event such as impression or
click
... we also use an event ID
... might seem scarey from privacy standpoint, but doesn't
leave the browser
... there to know what event we are talking about to run
computation over this trail
... then a data object
... blob could contain IG, products, what web site click
happened on
... browser is collecting these trails
<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#murres-tech
Andrew: we also think this trail
notion could be useful for other things like privacy and
reporting
... you will hear more on that later
... just talking about MURRE now
... browser accumulates these trals
... for storage in browser; not supposed to escape browser in
any way
... We say a dSP can provid JS that can run over these
trails
... when browser invokes that JS, perform an obstraction
... takes in trail and event ID
... when that gets triggered
... maybe on every impression
... probably conversion event
... I go through this in the document
... open to talk about more triggers
... but you don't want to trigger after all these events
... browser would have delay to avoid timing attacks
... running immediately after, like after showing ad
... doesn't provide enough time for user to click on ad
... might give yourself a two hour delay
... give full timeline
... event ID of impression being reported on
... find that event and see if it had a click, and provide a
lable
s/label
scribe: at this point we can talk
about job of extractor
... it wants to return a set of arbitrary strings
... in the example I provide
... in the document
... you will see things for this browser, this advertiser saw
this many impressions on this campaign
... very detailed information
... could be randomized strings
... something people cannot just read
... and figure out what exactly the feature is
<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#feature-extraction
scribe: it can be
obfuscated
... browser is responsible for further obfuscation of
strings
... DSP also provides a dimension, we call M
... essentially, these strings get hashed by a well known hash
function
... can only be up to this size M
<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#the-browser-computes-z
scribe: this set of strings is
converted into a set of integers
... size of M has to do with math
... dimensions in which @ exists in ML data set
... browser adds random integers into the set
... how many depends upon the math and probabilities set
up
... by hashing things
... you can have a large set of possible strings that might go
in
... but you can have hash collisions
... this provides some privacy protection
... some anonymity
... cryptographic hash function
... add these integers to add plausible deniablility that we
see from mdifferential privacy mechanisms
... browser reports back to DSP
... we don't want browser calling DSP directly
... we propose another 3rd party trusted service to receive the
payload and forward on to DSP
<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#reporting-through-a-proxy
scribe: can be encrypted so proxy
cannot read it
... and DSP can decrypt on their side
... no user ID provided
... browser doesn't call...feature vector
... just get list of integers that represent features
... do what you want with that data set
... run ML algorithms and produce generalized model
... what we did here
... next part is how you do the inference
... once you have the model, how do you actually use it
... if hash function is known and standardized
... if DSP...three types of data
... what's on adv site, contextually, and what browser
knows
... on adv site, DSP takes what it sees about user and hash
those features
... those contextual features can be hashed in same way
... to provide another partial inference for what prediction
might be
<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#inference
scribe: there is another object
the DSP passes back that contains private data
... and you can encode what the prediction is
... finally there are the browser signals; this is the
trickiest one
... these things cannot leave the browser; frequency
metrics
<wseltzer> ... all the models are still sitting server-side
scribe: this browser has seen
this many ads
... if DPS knows the strings it would produce for a given
campaign and set of frequencies
... and can produce those strings offline, hash them and look
up into model
... small look up table
... that lookup table can be written in browser
... like ad writing object
... private data object
... bid function receives all these packages
... and can all be combined at bid time
... that is really the description of the mechanism
... about computing sets of numbers, adding add'l random
numbers and reporting it back
... may be more complicated on inference side
... there is the mathematics of it
<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#murres-mathematics
scribe: what happens under hood
is differential privacy mechanism repeated a bunch of
times
... we can compare MURRE mechanisms against others to see which
give better differential privacy
... I did a literature review on doing sparse learning
... for local differential privacy mechanisms; ML
capacity
... research is a little bit of a bummer
... not a lot in academia that suggests anything feasible at a
very high level of privacy
... reading through this stuff; papers suggest stuff that won't
work for adtech like sending gigabytes over the wire
... I think we kind of need to have a big discussion what our
epsilon could poss be
... a lot of these vectors are sparse; lots of zeros; few
ones
... not criticizing the framework
<bleparmentier> I am critizing it me!
scribe: differential privacy will treat the zeros as the ones
<bleparmentier> :)
<bleparmentier> (the framework is adapted)
<bleparmentier> Not adapted
scribe: reporting on something
that did not happen is considered privacy violating
... might have to expect larger epsiolons...in other pieces of
the literature
... two parameters, dimension of hash space, and add new random
number
... have a direct effect on epsilon
... we ran tests on ML performance
... things were not as bad as expecting
... used a couple models, linear, quadratic using factor vector
machines
... we could look at 10% loss at auc
... model dimension is low
... things are more sensitive when we add a bunch of
noise
... we add 256 new features
... that could have been 256 advertisers, web sites
... we think that provides enough plausible deniability for the
browser
... difficult to look things up
... is in practice potentially doable, but it is
expensive
... to find one person; needle in haystack
... more consequences for quadratic model
... for linear model...could get some privacy guarantees
... with types of models we build out of a system like
this
... that is it
... what I wanted to run through
Wendy: Thanks, Andrew. You have a
queue for questions
... start by saying I have put links into irc for the
explainer
<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#introduction
Wendy: folks familiar with
different pieces of tech or math, can look back
... at the high level introduction and how this fits into our
work
... thanks for that intro
Blepartmentier: ask a
question
... what you are proposing is to hash the input vector
... for display
... and noise for using mechanism
... all the bits
... of this vector
... label and all dimensions
... publisher name, position
... do they respond well?
Andrew: We do not perform mechanism on the labels
Ble: would be exact?
Andrew: labels would be
exact
... labels are...
... did not go into that as much
... labels would also be integers; come in the clear
... browsers enforce that integers don't go above a certain
value
... could set whatever we want
... DSP constructs and reads what it wants
Bleparmentier: each display would
have one row in the data set
... and the label would be one of 16
... thank you for this work
... anything as you said, even if you don't want all the
framework
... I think what you are trying to do
... if we want to use differential privacy
... something like what you propose we should look into
... what you quote in your paper
... we propose another mechanism
... but thank you for your work
... if we want to go to differential privacy, this is
useful
... we should still keep option of @
... this could be hard
Andrew: thank you
Kleber: thank you for the
description and presentation
... I am worried about the privacy side of this
... I should disclose my background
... I was a math professor before going into computers
... I am worried about your math proposal for noise
... introducing...in this large feature vector
... afraid the field of error correcting code
... is 70 year long demonstration
... Shanon's theorem says it doesn't matter how much noise you
add
... someone clever choosing the original vector
... doesn't matter how much noise
... you add; always possible to recover what was the
information was
... that you tried to put through this very noisy channel
... section of your description
... weaknesses section
... MURRE assumes independence in all dimensions
... some will be correlated
... and error correcting code
... is way to make these bits correlate
... and defeats...
... a big privacy problem as a result
[missed]
Andrew: core of discussion around
epsilon is in the proposal
... differential privacy is not necessarily
... how to go for this
... yes, I believe it is technically possible
... to construct some type of error correcting code that this
user was on this site at this time and so on
... believe diff privacy not trying to prevent that
... this notion of do I have this plausible deniability of what
my activities are
... at end of day, you are trying to make it difficult to
figure out who is doing what when
... vectors...would be expensive
... doing at scale
... if picking a well known hash function
... a DSP JS on the fly
... going to encode enough into to generate hashes...would be
such a giant piece of JS
... too big to download for dimensions
... point is well taken
... I don't believe as adtech ecosystem
... we can do what we need to do from ML standpoint
... when it seems like the goal is to get epsilon as close to
zero as possible
... that just trashes all of the information
... I am viewing this as a starting point; find improvements
over it
... I wanted to throw something out that would have a
measurable level of privacy
Kleber: that makes sense
... epsilon of scheme you describe is infinity
... but I don't think it imparts any additional privacy
Andrew: i talk about that as well
with hash function
... diff privacy...hash function
... doesn't
... trivial example
... diff privacy takes worst case scenario
... take something unique to a particular value
... giant set of things
... a unique output in that space, then you end up with epsilon
of infinity
... in practice I don't think that is a problem
Mehul: I think you are mixing up
error code and differential privacy
... not sure claim is if you can construct meaningful
vector
... problem of producing vector with or without user in
set
... doesn't mean you can do vector without @ code
... a person observing vector that user was ...intersect
... we should not mix up whether vector can be
constructed
... we should project [missed]....
Jonasz: thanks, Wendy
... wanted to quickly note, at RTB House, how to train models
in TD is one of most important open questions
... we don't have comments for MURRE yet
... we will pay close attetion
... say thank you Andrew for starting this dicussion
Charlie, Google: thanks for publishing this proposal
scribe: I thought it was really
interesting
... share some potential research directions
... not sure if you are aware of these things
... first one you mentioned
... feature vectors end up with lots of zeros and few
ones
... so very sparse
... if you could prove to browser that some states are less
sensitive
... there is research on this spectrum of differential
privacy
... don't have to provide less
... data...one sided DP
<wseltzer> "one-sided DP"
<charlieharrison> https://arxiv.org/pdf/1811.12469.pdf
scribe: took the binary that some
data not sensitive at all
... context aware DP
... I see it as kind of a spectrum
... can say this is much more or much less private
... not sure where this...
... cannot rely on the fact that a zero is less sensitive than
a one
... maybe changes to @
... other one to mention
<charlieharrison> https://arxiv.org/pdf/1811.12469.pdf
scribe: this technique called
privacy amplification by Scheffling
... link a paper
... in the chat
<charlieharrison> https://arxiv.org/abs/1911.00038
scribe: here is the context
aware
... one I just linked
... privacy amplification
... if you have a bunch of local DP objects
... a bunch of these feature vectors with a local DP
... if you can route through shuffling proxy or some type of
service that permutes them
... you can show amplification of privacy through that
shuffling mechanism
... can shuffle encrypted reports
... and prove that you got privacy for free
<AramZS> -
Andrew: I would need to read
papers; concerned about some kind of permutation
happening
... at end of day, you still need to form inference
... whatever you are trying to compute on inference side, has
to match up with training phase
... if you are shuffling things around in a random way...if
shuffling not too random, maybe that could work
@: Just shuffling vectors amongst themselves?
Andrew: [missed]
... information contains info where zero and ones come in
Charlie: it's taking the whole
vector itself and shuffling different vectors from different
browsers
... that is what mechanism would be doing
Andrew: at Nextroll we use SGD
where
... you have to shuffle anyway; trying to do a random
descent
... dangerous for optimizers; order doesn't matter
Charlie: that is case with local
DP
... suggest seeing if shuffling technique works
... show increased privacy balance
... those are the only two things to say
... but want to echo Michael's point
... it doesn't seem as hard to pull off the attacks as you
suggest; there might be something here
Andrew: I am happy to chat more
about what Michael brought up
... not supposed to apply DP to hash functions
... not a randomnized mechanism; will result in epsilon
infinity
... I did abuse it
... did call it out; not that big an issue for us
Joel from OpenX: This seems like an interesting tool
scribe: how to bootstrap models
in a privacy preserving way
... could be used for DSPs, brands...
... as a browser user, concern that anyone with page access
could store tracking events
... want to view what are the trails, who can set them
... sounds like a complicated UR
s/UI
scribe: how to do it simply
Andrew: could have some
interesting explorers
... show me set of events for a particular advertiser
... how many ads have I seen; have I bought anything; what
conversion exists
... trail mechanism is most useful thing in proposal
... enables these ML data sets, enable better APIs
... also add interfaces on top to give users more
... like a swath to have better control
... and see what the browser knows about them
Wendy: I am on the queue
... to say thank you
... at a high level
... this proposal thinks about the ecosystem and how various
components might fit together
... I encourage others to look at the set of components we have
been building
... how will this serve your use cases; how does it make data
come out useful
... where else do we need to analyze the privacy and utility of
components of each of use cases you bring forward
bleparmentier: back again
... I just wnat to say we have been one of the supporters of
use case of optimization
... inference...important use case
... requires granular data
... with MURRE...use cases that need
Basile: report would be used for
inference, optimization use cases
... other use cases might benefit from more granular or more
precise data set
... one of them being fraud
... or bug detection
... doing a lot at Criteo
<dialtone> fraud detection
Basile: bug on correction
... hard to find such bugs
... without granular data
... in these cases, important that it is still @
... if you want to find the bug you can learn
... it is associated with the bug
... but not useful if you don't know what has is
... cases that require both granularity
... and understandability
... we try to keep data clear if there is this capability
... maybe we will need different report for each use case
... less reports the better
... if it's too risky for privacy
... let's not forget a lot of use cases that need granular
understandable data
Andrew: I agree we still need mechanisms for bug detection; agree this is not for that
<bleparmentier> btw the reporting in SPARROW proposalhttps://github.com/WICG/sparrow/blob/master/Reporting_in_SPARROW.md
Erik: thanks for the proposal
<bleparmentier> https://github.com/WICG/sparrow/blob/master/Reporting_in_SPARROW.md
Erik: I enjoyed reading it
... one suggestion
... one place for more research is to look at embedding
vectors
... we have thought about it in a couple other instances
... you might be able to get more utility for same privacy
budget
... and have a better chance to improve privacy...browser runs
that function
... unless you have a stronger proof of differential
privacy
Andrew: i come back to inference
side; what DSP has to know to do embedding themselves for data
they have at inference time
... and can that be cleanly split up
... the three types of data: adv, contextual and browser
data
... kind of...not embedding obfuscation
... browser data is particularly susceptible to this
... DSP never gets access
... somehow needs to output something that allows a look up in
browser to extract what it needs for frequency...
... can be complicated if you have something beyond a simple
lookup
... that is my initial reply
bmay: I was also thinking about
trails aspect of this
... and similar sorts of data collection on browser
... regarding cookies
... have browser define what can be read and written through
the cookies, a JSON based API
... changes on server side but not a lot of changes on the web
side
... things mentioned here fit that sort of model
Andrew: is that transitional
mechanism?
... operating under assumption that with third party cookies,
that I can see another web site is privacy violating
bmay: I would have to think about that
Remi: thanks for this insightful
proposal
... I have a few questions
<jrosewell> +1 to bmay comment - a great question
Remi: to give a metric to measure
privacy
... and to try to ....
... question about the trails between M
... and hashing of the @ a P
... seems like
... divide by two
... and 27 bits
,...is much quicker than P by 2
scribe: also think it's a bit
intuitive
... that hash function is ok
... given load
... like 23
... as soon as you have this noise
... model will be @
... means actually optimal tradeoff would be a small M
... lower number of noise
... not adding 200 ....each time
... did I understand correctly?
Andrew: there will be this
tradeoff between M and P
... I am more concerned with technical limitations
... M being 23 is two to the 23
... if you have a low P, adding a lot of noise, shipping bits
over the wire
... set P to 0
... and send a completely dense @ vector
... half of integers in that set...over the wire
... we did not look at the interplay between M and P so
much
... there were a lot of experiments to run
... we may want to look at
... they play together in a way that is not entirely obvious
what is the degredation on the model
... may make sense to lower M
Remi: lowering M
provides...rather than adding noise to P
... wondering about experiments
... whether degredation...would be....
... open question...is it possible to have common data sets,
delta of @...and degredation for privacy proposal
Andrew: we see standard data sets all the time to figure out performance
Remi: interesting to
collaborate
... see if it is possible
Wendy: Thatnks
... we are a bit over time
... put the question of data sets to another discussion
... very quickly, want to note
<wseltzer> zakim take up agendum 6
Wendy: the W3C NY Metro Chapter
event this week will feature some participants in this
group
... link in the minutes
Wendy: if you cannot make a meeting next week, give me a heads up or a -1 in irc channel
<joelmeyer> -1
Wendy: otherwise, assuming a
global community, we have plenty of people interested in
meeting next week
... and we will continue to gather agenda items
... should have enough to meet on the 24th of November
... thank you and see you later
... event is on Thursday, 19 Nov.
<Mikjuo> -1
<wseltzer> [adjourned]
This is scribe.perl Revision of Date Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: Irssi_ISO8601_Log_Text_Format (score 1.00) Succeeded: s/@/IAB Europe/ Succeeded: s/TPAC requires/ML requires/ Succeeded: s/NextRole/NextRoll/G WARNING: Bad s/// command: s/label Succeeded: s/has/hash/ Succeeded: s/@/row in the data set/ Succeeded: s/do privacy/use differential privacy/ Succeeded: s/mirror assumes @/MURRE assumes independence/ Succeeded: s/worse/worst/ Succeeded: s/@/Mehul/ Succeeded: s/we.../we use SGD where/ Succeeded: s/not bring up @/not supposed to apply DP to hash functions/ Succeeded: s/anyone/as a browser user, concern that anyone/ WARNING: Bad s/// command: s/UI Succeeded: s/knowledge/granular data/ Succeeded: s/being.../being fraud/ Succeeded: s/candidate/granular/ Succeeded: s/@/understandability/ Succeeded: s/@/granular understandable data/ Present: wseltzer mlerra bmay lbasdevant cpn apascoe_ AramZS bmilekic kris_chapman mserrate bleparmentier pl_mrcy Mikjuo arnaud_blanchard Karen blassey imeyers dkwestbr hcai kleber xaxisx Jukka ionel charlieharrison eriktaubeneck mjv Mike_Pisula_Xaxis joelmeyer br-rtbhouse wbaker jonasz shigeki GarrettJohnson jrosewell Dinesh-PubMatic Found Scribe: Karen Inferring ScribeNick: Karen WARNING: Dash separator lines found. If you intended them to mark the start of a new topic, you need the -dashTopics option. For example: <Philippe> --- <Philippe> Review of Action Items Agenda: https://lists.w3.org/Archives/Public/public-web-adv/2020Nov/0008.html WARNING: No meeting chair found! You should specify the meeting chair like this: <dbooth> Chair: dbooth WARNING: No date found! Assuming today. (Hint: Specify the W3C IRC log URL, and the date will be determined from that.) Or specify the date like this: <dbooth> Date: 12 Sep 2002 People with action items: WARNING: IRC log location not specified! (You can ignore this warning if you do not want the generated minutes to contain a link to the original IRC log.)[End of scribe.perl diagnostic output]