Improving Web Advertising BG -- 17 Nov 2020

<wseltzer> blassey: PEARG is hosting an interim meeting on IP privacy

<blassey> https://mailarchive.ietf.org/arch/msg/pearg/Ok7mxbJn6cZ0lvHVKC3RMSxnxXc/

<scribe> Scribe: Karen

Wendy: anyone who would like to introduce themselves

<wseltzer> mikko: NextRoll, Measurment

Niko from NextRoll

Andrew from IAB Europe

Joey from NextRoll

Wendy: we have 76 people on the call here
... please "present+" yourself on irc to help us keep a record

MURRE, Mechanism for User Reports with Regulated Epsilon, [from

Wendy: Andrew, are you ready to finish discussion of MURRE proposal

<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md

Andrew: If you haven't read the MURRE spec, I will do a quick overview

<GarrettJohnson> Wendy, can you announce this event? Public W3C event on AdTech with Don Marti, Wendy Seltzer, Aram Zucker-Sharff & Robin Berjon on Thursday morning 10am ET (also webstreamed live): https://www.eventbrite.com/e/webinnovationx-back-to-the-future-2020-fall-tickets-122613667781

Andrew: what it is, we wanted to tackle problem of how to do machine learning (ML)
... in this world
... do some aggregated reports, but have not seen concrete proposals
... at TPAC, usually ML requires granular data sets
... just a starting point
... I lay out some of weaknesses in the document as well
... let's talk about what it does
... takes a locally differentiated private approach
... it is a relatively simple mechanism
... first thing browser needs to do, for each DSP, would maintain a state that we internally at NextRoll call a trail
... things get pushed into this trail; list of JSON objects
... pixel events, impression, click, conversion events
... get pushed into this timeline object
... the proposal doesn't go into ton of detail of all the objects
... need time stamp, what type of event such as impression or click
... we also use an event ID
... might seem scarey from privacy standpoint, but doesn't leave the browser
... there to know what event we are talking about to run computation over this trail
... then a data object
... blob could contain IG, products, what web site click happened on
... browser is collecting these trails

<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#murres-tech

Andrew: we also think this trail notion could be useful for other things like privacy and reporting
... you will hear more on that later
... just talking about MURRE now
... browser accumulates these trals
... for storage in browser; not supposed to escape browser in any way
... We say a dSP can provid JS that can run over these trails
... when browser invokes that JS, perform an obstraction
... takes in trail and event ID
... when that gets triggered
... maybe on every impression
... probably conversion event
... I go through this in the document
... open to talk about more triggers
... but you don't want to trigger after all these events
... browser would have delay to avoid timing attacks
... running immediately after, like after showing ad
... doesn't provide enough time for user to click on ad
... might give yourself a two hour delay
... give full timeline
... event ID of impression being reported on
... find that event and see if it had a click, and provide a lable

s/label

scribe: at this point we can talk about job of extractor
... it wants to return a set of arbitrary strings
... in the example I provide
... in the document
... you will see things for this browser, this advertiser saw this many impressions on this campaign
... very detailed information
... could be randomized strings
... something people cannot just read
... and figure out what exactly the feature is

<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#feature-extraction

scribe: it can be obfuscated
... browser is responsible for further obfuscation of strings
... DSP also provides a dimension, we call M
... essentially, these strings get hashed by a well known hash function
... can only be up to this size M

<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#the-browser-computes-z

scribe: this set of strings is converted into a set of integers
... size of M has to do with math
... dimensions in which @ exists in ML data set
... browser adds random integers into the set
... how many depends upon the math and probabilities set up
... by hashing things
... you can have a large set of possible strings that might go in
... but you can have hash collisions
... this provides some privacy protection
... some anonymity
... cryptographic hash function
... add these integers to add plausible deniablility that we see from mdifferential privacy mechanisms
... browser reports back to DSP
... we don't want browser calling DSP directly
... we propose another 3rd party trusted service to receive the payload and forward on to DSP

<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#reporting-through-a-proxy

scribe: can be encrypted so proxy cannot read it
... and DSP can decrypt on their side
... no user ID provided
... browser doesn't call...feature vector
... just get list of integers that represent features
... do what you want with that data set
... run ML algorithms and produce generalized model
... what we did here
... next part is how you do the inference
... once you have the model, how do you actually use it
... if hash function is known and standardized
... if DSP...three types of data
... what's on adv site, contextually, and what browser knows
... on adv site, DSP takes what it sees about user and hash those features
... those contextual features can be hashed in same way
... to provide another partial inference for what prediction might be

<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#inference

scribe: there is another object the DSP passes back that contains private data
... and you can encode what the prediction is
... finally there are the browser signals; this is the trickiest one
... these things cannot leave the browser; frequency metrics

<wseltzer> ... all the models are still sitting server-side

scribe: this browser has seen this many ads
... if DPS knows the strings it would produce for a given campaign and set of frequencies
... and can produce those strings offline, hash them and look up into model
... small look up table
... that lookup table can be written in browser
... like ad writing object
... private data object
... bid function receives all these packages
... and can all be combined at bid time
... that is really the description of the mechanism
... about computing sets of numbers, adding add'l random numbers and reporting it back
... may be more complicated on inference side
... there is the mathematics of it

<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#murres-mathematics

scribe: what happens under hood is differential privacy mechanism repeated a bunch of times
... we can compare MURRE mechanisms against others to see which give better differential privacy
... I did a literature review on doing sparse learning
... for local differential privacy mechanisms; ML capacity
... research is a little bit of a bummer
... not a lot in academia that suggests anything feasible at a very high level of privacy
... reading through this stuff; papers suggest stuff that won't work for adtech like sending gigabytes over the wire
... I think we kind of need to have a big discussion what our epsilon could poss be
... a lot of these vectors are sparse; lots of zeros; few ones
... not criticizing the framework

<bleparmentier> I am critizing it me!

scribe: differential privacy will treat the zeros as the ones

<bleparmentier> :)

<bleparmentier> (the framework is adapted)

<bleparmentier> Not adapted

scribe: reporting on something that did not happen is considered privacy violating
... might have to expect larger epsiolons...in other pieces of the literature
... two parameters, dimension of hash space, and add new random number
... have a direct effect on epsilon
... we ran tests on ML performance
... things were not as bad as expecting
... used a couple models, linear, quadratic using factor vector machines
... we could look at 10% loss at auc
... model dimension is low
... things are more sensitive when we add a bunch of noise
... we add 256 new features
... that could have been 256 advertisers, web sites
... we think that provides enough plausible deniability for the browser
... difficult to look things up
... is in practice potentially doable, but it is expensive
... to find one person; needle in haystack
... more consequences for quadratic model
... for linear model...could get some privacy guarantees
... with types of models we build out of a system like this
... that is it
... what I wanted to run through

Wendy: Thanks, Andrew. You have a queue for questions
... start by saying I have put links into irc for the explainer

<wseltzer> https://github.com/AdRoll/privacy/blob/main/MURRE.md#introduction

Wendy: folks familiar with different pieces of tech or math, can look back
... at the high level introduction and how this fits into our work
... thanks for that intro

Blepartmentier: ask a question
... what you are proposing is to hash the input vector
... for display
... and noise for using mechanism
... all the bits
... of this vector
... label and all dimensions
... publisher name, position
... do they respond well?

Andrew: We do not perform mechanism on the labels

Ble: would be exact?

Andrew: labels would be exact
... labels are...
... did not go into that as much
... labels would also be integers; come in the clear
... browsers enforce that integers don't go above a certain value
... could set whatever we want
... DSP constructs and reads what it wants

Bleparmentier: each display would have one row in the data set
... and the label would be one of 16
... thank you for this work
... anything as you said, even if you don't want all the framework
... I think what you are trying to do
... if we want to use differential privacy
... something like what you propose we should look into
... what you quote in your paper
... we propose another mechanism
... but thank you for your work
... if we want to go to differential privacy, this is useful
... we should still keep option of @
... this could be hard

Andrew: thank you

Kleber: thank you for the description and presentation
... I am worried about the privacy side of this
... I should disclose my background
... I was a math professor before going into computers
... I am worried about your math proposal for noise
... introducing...in this large feature vector
... afraid the field of error correcting code
... is 70 year long demonstration
... Shanon's theorem says it doesn't matter how much noise you add
... someone clever choosing the original vector
... doesn't matter how much noise
... you add; always possible to recover what was the information was
... that you tried to put through this very noisy channel
... section of your description
... weaknesses section
... MURRE assumes independence in all dimensions
... some will be correlated
... and error correcting code
... is way to make these bits correlate
... and defeats...
... a big privacy problem as a result

[missed]

Andrew: core of discussion around epsilon is in the proposal
... differential privacy is not necessarily
... how to go for this
... yes, I believe it is technically possible
... to construct some type of error correcting code that this user was on this site at this time and so on
... believe diff privacy not trying to prevent that
... this notion of do I have this plausible deniability of what my activities are
... at end of day, you are trying to make it difficult to figure out who is doing what when
... vectors...would be expensive
... doing at scale
... if picking a well known hash function
... a DSP JS on the fly
... going to encode enough into to generate hashes...would be such a giant piece of JS
... too big to download for dimensions
... point is well taken
... I don't believe as adtech ecosystem
... we can do what we need to do from ML standpoint
... when it seems like the goal is to get epsilon as close to zero as possible
... that just trashes all of the information
... I am viewing this as a starting point; find improvements over it
... I wanted to throw something out that would have a measurable level of privacy

Kleber: that makes sense
... epsilon of scheme you describe is infinity
... but I don't think it imparts any additional privacy

Andrew: i talk about that as well with hash function
... diff privacy...hash function
... doesn't
... trivial example
... diff privacy takes worst case scenario
... take something unique to a particular value
... giant set of things
... a unique output in that space, then you end up with epsilon of infinity
... in practice I don't think that is a problem

Mehul: I think you are mixing up error code and differential privacy
... not sure claim is if you can construct meaningful vector
... problem of producing vector with or without user in set
... doesn't mean you can do vector without @ code
... a person observing vector that user was ...intersect
... we should not mix up whether vector can be constructed
... we should project [missed]....

Jonasz: thanks, Wendy
... wanted to quickly note, at RTB House, how to train models in TD is one of most important open questions
... we don't have comments for MURRE yet
... we will pay close attetion
... say thank you Andrew for starting this dicussion

Charlie, Google: thanks for publishing this proposal

scribe: I thought it was really interesting
... share some potential research directions
... not sure if you are aware of these things
... first one you mentioned
... feature vectors end up with lots of zeros and few ones
... so very sparse
... if you could prove to browser that some states are less sensitive
... there is research on this spectrum of differential privacy
... don't have to provide less
... data...one sided DP

<wseltzer> "one-sided DP"

<charlieharrison> https://arxiv.org/pdf/1811.12469.pdf

scribe: took the binary that some data not sensitive at all
... context aware DP
... I see it as kind of a spectrum
... can say this is much more or much less private
... not sure where this...
... cannot rely on the fact that a zero is less sensitive than a one
... maybe changes to @
... other one to mention

<charlieharrison> https://arxiv.org/pdf/1811.12469.pdf

scribe: this technique called privacy amplification by Scheffling
... link a paper
... in the chat

<charlieharrison> https://arxiv.org/abs/1911.00038

scribe: here is the context aware
... one I just linked
... privacy amplification
... if you have a bunch of local DP objects
... a bunch of these feature vectors with a local DP
... if you can route through shuffling proxy or some type of service that permutes them
... you can show amplification of privacy through that shuffling mechanism
... can shuffle encrypted reports
... and prove that you got privacy for free

<AramZS> -

Andrew: I would need to read papers; concerned about some kind of permutation happening
... at end of day, you still need to form inference
... whatever you are trying to compute on inference side, has to match up with training phase
... if you are shuffling things around in a random way...if shuffling not too random, maybe that could work

@: Just shuffling vectors amongst themselves?

Andrew: [missed]
... information contains info where zero and ones come in

Charlie: it's taking the whole vector itself and shuffling different vectors from different browsers
... that is what mechanism would be doing

Andrew: at Nextroll we use SGD where
... you have to shuffle anyway; trying to do a random descent
... dangerous for optimizers; order doesn't matter

Charlie: that is case with local DP
... suggest seeing if shuffling technique works
... show increased privacy balance
... those are the only two things to say
... but want to echo Michael's point
... it doesn't seem as hard to pull off the attacks as you suggest; there might be something here

Andrew: I am happy to chat more about what Michael brought up
... not supposed to apply DP to hash functions
... not a randomnized mechanism; will result in epsilon infinity
... I did abuse it
... did call it out; not that big an issue for us

Joel from OpenX: This seems like an interesting tool

scribe: how to bootstrap models in a privacy preserving way
... could be used for DSPs, brands...
... as a browser user, concern that anyone with page access could store tracking events
... want to view what are the trails, who can set them
... sounds like a complicated UR

s/UI

scribe: how to do it simply

Andrew: could have some interesting explorers
... show me set of events for a particular advertiser
... how many ads have I seen; have I bought anything; what conversion exists
... trail mechanism is most useful thing in proposal
... enables these ML data sets, enable better APIs
... also add interfaces on top to give users more
... like a swath to have better control
... and see what the browser knows about them

Wendy: I am on the queue
... to say thank you
... at a high level
... this proposal thinks about the ecosystem and how various components might fit together
... I encourage others to look at the set of components we have been building
... how will this serve your use cases; how does it make data come out useful
... where else do we need to analyze the privacy and utility of components of each of use cases you bring forward

bleparmentier: back again
... I just wnat to say we have been one of the supporters of use case of optimization
... inference...important use case
... requires granular data
... with MURRE...use cases that need

Basile: report would be used for inference, optimization use cases
... other use cases might benefit from more granular or more precise data set
... one of them being fraud
... or bug detection
... doing a lot at Criteo

<dialtone> fraud detection

Basile: bug on correction
... hard to find such bugs
... without granular data
... in these cases, important that it is still @
... if you want to find the bug you can learn
... it is associated with the bug
... but not useful if you don't know what has is
... cases that require both granularity
... and understandability
... we try to keep data clear if there is this capability
... maybe we will need different report for each use case
... less reports the better
... if it's too risky for privacy
... let's not forget a lot of use cases that need granular understandable data

Andrew: I agree we still need mechanisms for bug detection; agree this is not for that

<bleparmentier> btw the reporting in SPARROW proposalhttps://github.com/WICG/sparrow/blob/master/Reporting_in_SPARROW.md

Erik: thanks for the proposal

<bleparmentier> https://github.com/WICG/sparrow/blob/master/Reporting_in_SPARROW.md

Erik: I enjoyed reading it
... one suggestion
... one place for more research is to look at embedding vectors
... we have thought about it in a couple other instances
... you might be able to get more utility for same privacy budget
... and have a better chance to improve privacy...browser runs that function
... unless you have a stronger proof of differential privacy

Andrew: i come back to inference side; what DSP has to know to do embedding themselves for data they have at inference time
... and can that be cleanly split up
... the three types of data: adv, contextual and browser data
... kind of...not embedding obfuscation
... browser data is particularly susceptible to this
... DSP never gets access
... somehow needs to output something that allows a look up in browser to extract what it needs for frequency...
... can be complicated if you have something beyond a simple lookup
... that is my initial reply

bmay: I was also thinking about trails aspect of this
... and similar sorts of data collection on browser
... regarding cookies
... have browser define what can be read and written through the cookies, a JSON based API
... changes on server side but not a lot of changes on the web side
... things mentioned here fit that sort of model

Andrew: is that transitional mechanism?
... operating under assumption that with third party cookies, that I can see another web site is privacy violating

bmay: I would have to think about that

Remi: thanks for this insightful proposal
... I have a few questions

<jrosewell> +1 to bmay comment - a great question

Remi: to give a metric to measure privacy
... and to try to ....
... question about the trails between M
... and hashing of the @ a P
... seems like
... divide by two
... and 27 bits

,...is much quicker than P by 2

scribe: also think it's a bit intuitive
... that hash function is ok
... given load
... like 23
... as soon as you have this noise
... model will be @
... means actually optimal tradeoff would be a small M
... lower number of noise
... not adding 200 ....each time
... did I understand correctly?

Andrew: there will be this tradeoff between M and P
... I am more concerned with technical limitations
... M being 23 is two to the 23
... if you have a low P, adding a lot of noise, shipping bits over the wire
... set P to 0
... and send a completely dense @ vector
... half of integers in that set...over the wire
... we did not look at the interplay between M and P so much
... there were a lot of experiments to run
... we may want to look at
... they play together in a way that is not entirely obvious what is the degredation on the model
... may make sense to lower M

Remi: lowering M provides...rather than adding noise to P
... wondering about experiments
... whether degredation...would be....
... open question...is it possible to have common data sets, delta of @...and degredation for privacy proposal

Andrew: we see standard data sets all the time to figure out performance

Remi: interesting to collaborate
... see if it is possible

Wendy: Thatnks
... we are a bit over time
... put the question of data sets to another discussion
... very quickly, want to note

<wseltzer> zakim take up agendum 6

Wendy: the W3C NY Metro Chapter event this week will feature some participants in this group
... link in the minutes

W3C chapter event, https://www.eventbrite.com/e/webinnovationx-back-to-the-future-2020-fall-tickets-122613667781

Wendy: if you cannot make a meeting next week, give me a heads up or a -1 in irc channel

<joelmeyer> -1

Wendy: otherwise, assuming a global community, we have plenty of people interested in meeting next week
... and we will continue to gather agenda items
... should have enough to meet on the 24th of November
... thank you and see you later
... event is on Thursday, 19 Nov.

<Mikjuo> -1

<wseltzer> [adjourned]

- DRAFT -

Improving Web Advertising BG
17 Nov 2020

Attendees

Contents

MURRE, Mechanism for User Reports with Regulated Epsilon, [from

W3C chapter event, https://www.eventbrite.com/e/webinnovationx-back-to-the-future-2020-fall-tickets-122613667781

Summary of Action Items

Summary of Resolutions

Scribe.perl diagnostic output

- DRAFT -

Improving Web Advertising BG 17 Nov 2020

Attendees

Contents

MURRE, Mechanism for User Reports with Regulated Epsilon, [from

W3C chapter event, https://www.eventbrite.com/e/webinnovationx-back-to-the-future-2020-fall-tickets-122613667781

Summary of Action Items

Summary of Resolutions

Scribe.perl diagnostic output

Improving Web Advertising BG
17 Nov 2020