[Script Info]
Title: 
[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:21.75,0:00:28.68,Default,,0000,0000,0000,,On behalf of the Department of Computer Science and Automation at IISc
Dialogue: 0,0:00:28.70,0:00:34.56,Default,,0000,0000,0000,,I would like to wish you all a very warm welcome to this first public lecture
Dialogue: 0,0:00:34.72,0:00:38.98,Default,,0000,0000,0000,,on Foundations of Data Science by Professor Ravi Kannan
Dialogue: 0,0:00:39.61,0:00:45.05,Default,,0000,0000,0000,,as all of you know, big data or data science
Dialogue: 0,0:00:45.59,0:00:49.40,Default,,0000,0000,0000,,represent the latest wave in the computing revolution
Dialogue: 0,0:00:50.55,0:00:55.56,Default,,0000,0000,0000,,now the volume, the complexity, and the variety of the data
Dialogue: 0,0:00:55.56,0:00:59.80,Default,,0000,0000,0000,,that is now becoming available day after day
Dialogue: 0,0:01:00.22,0:01:03.44,Default,,0000,0000,0000,,is now at an unprecedented scale
Dialogue: 0,0:01:04.20,0:01:09.63,Default,,0000,0000,0000,,so if you are able to extract useful and actionable knowledge out of this data
Dialogue: 0,0:01:09.89,0:01:13.49,Default,,0000,0000,0000,,that's going to be of a huge benefit
Dialogue: 0,0:01:14.22,0:01:17.73,Default,,0000,0000,0000,,so this initiative, which is called the big data initiative
Dialogue: 0,0:01:18.26,0:01:21.31,Default,,0000,0000,0000,,has just been launched in order to realize
Dialogue: 0,0:01:21.79,0:01:28.09,Default,,0000,0000,0000,,the potential of this very exciting, and very fascinating, and very challenging area
Dialogue: 0,0:01:29.42,0:01:33.62,Default,,0000,0000,0000,,now the primary objective of the big data initiative
Dialogue: 0,0:01:34.48,0:01:39.38,Default,,0000,0000,0000,,is to build a very strong academic and research ecosystem
Dialogue: 0,0:01:40.73,0:01:46.38,Default,,0000,0000,0000,,that would enable India's excellence in this fast emerging area
Dialogue: 0,0:01:46.38,0:01:50.42,Default,,0000,0000,0000,,and take India to a leadership position in this area
Dialogue: 0,0:01:51.48,0:01:54.84,Default,,0000,0000,0000,,now the faculty of the Department of Computer Science and Automation
Dialogue: 0,0:01:54.84,0:01:59.56,Default,,0000,0000,0000,,and also the Indian Institute of Science have come together to launch this initiative
Dialogue: 0,0:02:00.18,0:02:05.55,Default,,0000,0000,0000,,and they have lined up a series of ambitious programs as a part of this initiative
Dialogue: 0,0:02:06.81,0:02:11.41,Default,,0000,0000,0000,,so the first of these initiatives is going to be the public lectures program
Dialogue: 0,0:02:11.79,0:02:15.86,Default,,0000,0000,0000,,and the primary objective of the public lectures program
Dialogue: 0,0:02:16.78,0:02:26.64,Default,,0000,0000,0000,,is to give an introduction, an exposure, to all the fascinating research works that are going on in this emerging area
Dialogue: 0,0:02:27.14,0:02:31.87,Default,,0000,0000,0000,,and also to tell you more about the current state of the art in this area
Dialogue: 0,0:02:32.70,0:02:37.92,Default,,0000,0000,0000,,now to give you the first public lecture we have today with us Professor Ravi Kannan
Dialogue: 0,0:02:38.68,0:02:45.08,Default,,0000,0000,0000,,who is a legendary researcher in the area of algorithms and theoretical computer science for a long time
Dialogue: 0,0:02:46.63,0:02:51.81,Default,,0000,0000,0000,,Professor Ravi Kannan is currently an adjunct faculty member of the Department of Computer Science and Automation
Dialogue: 0,0:02:53.21,0:02:56.83,Default,,0000,0000,0000,,and as a principal researcher of the Microsoft Research India
Dialogue: 0,0:02:57.20,0:03:02.40,Default,,0000,0000,0000,,he is leading a very successful group on algorithms at MSR India
Dialogue: 0,0:03:03.72,0:03:11.78,Default,,0000,0000,0000,,and as all of you might know, he delivered the 2012 Professor I G Sarma memorial lecture
Dialogue: 0,0:03:11.78,0:03:14.78,Default,,0000,0000,0000,,in the same faculty hall
Dialogue: 0,0:03:14.78,0:03:21.72,Default,,0000,0000,0000,,and then in 2011 he was the recipient of the very prestigious
Dialogue: 0,0:03:21.72,0:03:26.33,Default,,0000,0000,0000,,Donald Knuth Prize for his outstanding contributions in designing
Dialogue: 0,0:03:26.33,0:03:30.65,Default,,0000,0000,0000,,creative algorithms for many computationally challenging problems
Dialogue: 0,0:03:30.65,0:03:36.99,Default,,0000,0000,0000,,and lets give Professor Ravi Kannan a wide applause for the Donald Knuth Prize\N[audience applause]
Dialogue: 0,0:03:38.76,0:03:44.59,Default,,0000,0000,0000,,and 20 years before that, way back in 1991
Dialogue: 0,0:03:44.59,0:03:48.35,Default,,0000,0000,0000,,he was the recipient of the prestigious Fulkerson Prize
Dialogue: 0,0:03:48.35,0:03:53.27,Default,,0000,0000,0000,,for his minor work on estimating the volume of convex sets
Dialogue: 0,0:03:54.19,0:03:58.74,Default,,0000,0000,0000,,so here is a researcher who has made ground breaking contributions
Dialogue: 0,0:03:58.74,0:04:01.84,Default,,0000,0000,0000,,on a variety of topics and algorithms
Dialogue: 0,0:04:02.15,0:04:06.44,Default,,0000,0000,0000,,now he has designed approximation algorithms for India programming
Dialogue: 0,0:04:07.08,0:04:10.13,Default,,0000,0000,0000,,and he has designed randomized algorithms for linereal algebra
Dialogue: 0,0:04:10.70,0:04:14.74,Default,,0000,0000,0000,,and modus [sounds like "unclear" but that's not right] learning algorithms for convex sets
Dialogue: 0,0:04:15.27,0:04:22.61,Default,,0000,0000,0000,,so he's truly, in my view, he is the {not english: sunil munaher gwaskil?} of the big data team
Dialogue: 0,0:04:23.46,0:04:27.20,Default,,0000,0000,0000,,and with a wide applause, let's welcome Professor Ravi Kannan\N[audience applause]
Dialogue: 0,0:04:27.20,0:04:33.27,Default,,0000,0000,0000,,and invite him to deliver this first public lecture, Professor Ravi Kannan\N[audience applause]
Dialogue: 0,0:04:51.97,0:04:55.72,Default,,0000,0000,0000,,so thanks Professor Narahari for that gracious introduction
Dialogue: 0,0:04:55.72,0:05:01.24,Default,,0000,0000,0000,,we await [Sulah Kandalkar?] but I'll start with [Rick Elsore?]
Dialogue: 0,0:05:01.24,0:05:03.62,Default,,0000,0000,0000,,I want to thank also CSA, this is a very nice initiative
Dialogue: 0,0:05:03.62,0:05:06.41,Default,,0000,0000,0000,,CSA has always been very dynamic
Dialogue: 0,0:05:06.41,0:05:11.24,Default,,0000,0000,0000,,and sort of looking to latest advances in various fields
Dialogue: 0,0:05:11.24,0:05:14.36,Default,,0000,0000,0000,,in addition to adding all the new exciting faculty
Dialogue: 0,0:05:14.36,0:05:16.95,Default,,0000,0000,0000,,I think it's a very good initiative to start on big data
Dialogue: 0,0:05:16.95,0:05:19.78,Default,,0000,0000,0000,,I should also recognize and thank Professor Shevade from CSA
Dialogue: 0,0:05:19.78,0:05:24.48,Default,,0000,0000,0000,,for having this idea and getting us all together, thank you Shevade\N[applause]
Dialogue: 0,0:05:29.73,0:05:34.72,Default,,0000,0000,0000,,so since this is the first lecture I should give you actually
Dialogue: 0,0:05:34.72,0:05:37.89,Default,,0000,0000,0000,,a sort of bird's eye view of the whole field in some sense
Dialogue: 0,0:05:37.89,0:05:39.64,Default,,0000,0000,0000,,I'm not very good at bird's eye view
Dialogue: 0,0:05:39.64,0:05:43.48,Default,,0000,0000,0000,,so I will give you a couple of slides which are very general
Dialogue: 0,0:05:43.48,0:05:46.67,Default,,0000,0000,0000,,and then zoom in on a piece of the action
Dialogue: 0,0:05:46.67,0:05:51.09,Default,,0000,0000,0000,,once slice of the field if you will, and sort of focus on that
Dialogue: 0,0:05:51.09,0:05:54.75,Default,,0000,0000,0000,,so the title of the talk is not big data, it's Foundations of Data Science
Dialogue: 0,0:05:54.75,0:05:57.09,Default,,0000,0000,0000,,which is sort of related because it is actually the title of a book
Dialogue: 0,0:05:57.09,0:05:58.97,Default,,0000,0000,0000,,I'm writing with John Hopcroft
Dialogue: 0,0:05:58.97,0:06:02.90,Default,,0000,0000,0000,,in the process we had to think about first the following
Dialogue: 0,0:06:02.90,0:06:08.45,Default,,0000,0000,0000,,what are really, what would you call the foundations of data science?
Dialogue: 0,0:06:09.51,0:06:12.43,Default,,0000,0000,0000,,see the word "big data" connotes something very jazzy
Dialogue: 0,0:06:12.43,0:06:15.59,Default,,0000,0000,0000,,and so that can have two interpretations
Dialogue: 0,0:06:15.59,0:06:18.90,Default,,0000,0000,0000,,one is something cool or hot or whatever you call it
Dialogue: 0,0:06:18.90,0:06:21.74,Default,,0000,0000,0000,,with perhaps not so much substance
Dialogue: 0,0:06:21.74,0:06:24.03,Default,,0000,0000,0000,,one is something with a lot of substance
Dialogue: 0,0:06:24.03,0:06:26.76,Default,,0000,0000,0000,,I will hope to convince you that it's both
Dialogue: 0,0:06:26.76,0:06:28.96,Default,,0000,0000,0000,,it's hot but also has a lot of substance
Dialogue: 0,0:06:28.96,0:06:31.53,Default,,0000,0000,0000,,and so the word "foundations of data science" this phrase
Dialogue: 0,0:06:31.53,0:06:34.47,Default,,0000,0000,0000,,connotes perhaps the substance part of it
Dialogue: 0,0:06:34.47,0:06:37.45,Default,,0000,0000,0000,,but the question is what are, what should you consider
Dialogue: 0,0:06:37.45,0:06:40.02,Default,,0000,0000,0000,,to be part of the foundations of data science
Dialogue: 0,0:06:40.02,0:06:43.42,Default,,0000,0000,0000,,or what should you teach if you're going to teach a course on data science
Dialogue: 0,0:06:43.42,0:06:46.04,Default,,0000,0000,0000,,or foundations of data science, what should you teach
Dialogue: 0,0:06:46.04,0:06:50.72,Default,,0000,0000,0000,,so here's something that if you haven't been to a talk with this sort of title
Dialogue: 0,0:06:50.72,0:06:54.39,Default,,0000,0000,0000,,you won't have seen this but if you've been to a talk
Dialogue: 0,0:06:54.39,0:06:58.66,Default,,0000,0000,0000,,almost certainly you've seen a curve like this which I then draw
Dialogue: 0,0:06:59.10,0:07:03.90,Default,,0000,0000,0000,,this is exponential cx for some constant
Dialogue: 0,0:07:05.56,0:07:07.96,Default,,0000,0000,0000,,so various forms of this curve
Dialogue: 0,0:07:07.96,0:07:11.19,Default,,0000,0000,0000,,this is the rate of growth of data if you will
Dialogue: 0,0:07:11.19,0:07:13.46,Default,,0000,0000,0000,,or the rate of growth of storage and speed and so on
Dialogue: 0,0:07:13.46,0:07:15.16,Default,,0000,0000,0000,,there are different constants
Dialogue: 0,0:07:15.16,0:07:19.82,Default,,0000,0000,0000,,now there is really a point, I'm not making sort of caricature out of this
Dialogue: 0,0:07:19.82,0:07:25.97,Default,,0000,0000,0000,,but it is really a point in that it's not as if human knowledge has grown exponentially
Dialogue: 0,0:07:25.97,0:07:30.51,Default,,0000,0000,0000,,but it is true that our ability to gather data, to store it
Dialogue: 0,0:07:30.51,0:07:33.58,Default,,0000,0000,0000,,to annotate it has grown perhaps exponentially
Dialogue: 0,0:07:33.58,0:07:36.79,Default,,0000,0000,0000,,and in fact so we do have exponential rate of growth of data
Dialogue: 0,0:07:36.79,0:07:38.42,Default,,0000,0000,0000,,but I'm not going to draw you this curve
Dialogue: 0,0:07:38.42,0:07:42.68,Default,,0000,0000,0000,,perhaps my suggestion is, what I'm going to put in the list here
Dialogue: 0,0:07:42.68,0:07:46.06,Default,,0000,0000,0000,,the first few things, maybe the answer is no
Dialogue: 0,0:07:46.06,0:07:49.29,Default,,0000,0000,0000,,they are not what I would consider the foundations of data science
Dialogue: 0,0:07:49.29,0:07:51.42,Default,,0000,0000,0000,,but they are important never the less
Dialogue: 0,0:07:51.42,0:07:54.18,Default,,0000,0000,0000,,you could also ask how do you collect and store
Dialogue: 0,0:07:54.18,0:07:58.01,Default,,0000,0000,0000,,annotate and store data in various domains
Dialogue: 0,0:07:58.01,0:08:02.74,Default,,0000,0000,0000,,in fact some of these responses will depend what domain will it dwell
Dialogue: 0,0:08:02.74,0:08:07.23,Default,,0000,0000,0000,,be it astronomy, commerce. I put up three areas here in which we do have in fact
Dialogue: 0,0:08:07.23,0:08:12.41,Default,,0000,0000,0000,,really a big growth in data that's been collected and stored
Dialogue: 0,0:08:12.41,0:08:18.81,Default,,0000,0000,0000,,but I would like to suggest that this is not, for us, foundations of data science at the moment
Dialogue: 0,0:08:18.81,0:08:22.31,Default,,0000,0000,0000,,partly because it's very important but it depends on the domain
Dialogue: 0,0:08:22.31,0:08:25.63,Default,,0000,0000,0000,,so it's very specific to each domain perhaps
Dialogue: 0,0:08:25.63,0:08:28.68,Default,,0000,0000,0000,,now you could also say basic statistics is important
Dialogue: 0,0:08:28.68,0:08:34.15,Default,,0000,0000,0000,,because if you have a lot of data, more than you can handle simply by traditional algorithms
Dialogue: 0,0:08:34.15,0:08:35.91,Default,,0000,0000,0000,,surely you must sample
Dialogue: 0,0:08:35.91,0:08:39.02,Default,,0000,0000,0000,,that is you must draw a random sample of the data and deal with that
Dialogue: 0,0:08:39.02,0:08:43.81,Default,,0000,0000,0000,,so sampling is important, so a study of sampling and basic statistics is perhaps important
Dialogue: 0,0:08:43.81,0:08:48.02,Default,,0000,0000,0000,,now basic statistics is always important, but we have to assume that and move on
Dialogue: 0,0:08:48.02,0:08:52.97,Default,,0000,0000,0000,,now the later parts of this are becoming progressively more "yes" answers
Dialogue: 0,0:08:52.97,0:08:57.97,Default,,0000,0000,0000,,but perhaps here is a sort of mea culpa though I am a computer scientist
Dialogue: 0,0:08:57.97,0:09:01.42,Default,,0000,0000,0000,,and in particular specialize in algorithms
Dialogue: 0,0:09:01.42,0:09:09.02,Default,,0000,0000,0000,,for the last three decades perhaps, the area of algorithms focused on discrete structures
Dialogue: 0,0:09:09.02,0:09:13.72,Default,,0000,0000,0000,,graphs and various other discrete structures, data structures are a big thing
Dialogue: 0,0:09:13.72,0:09:17.05,Default,,0000,0000,0000,,if you've done computer science courses you've done [lifts?] arrays
Dialogue: 0,0:09:17.05,0:09:20.01,Default,,0000,0000,0000,,and stacks and so on so forth, that was a big thing
Dialogue: 0,0:09:20.01,0:09:26.08,Default,,0000,0000,0000,,now also we spent a couple of decades improving the algorithms for various discrete problems
Dialogue: 0,0:09:26.08,0:09:28.52,Default,,0000,0000,0000,,like graph problems, shorter spots and flows
Dialogue: 0,0:09:28.52,0:09:30.85,Default,,0000,0000,0000,,now these are very important going forward
Dialogue: 0,0:09:30.85,0:09:36.17,Default,,0000,0000,0000,,but we have focused enough attention on these matters for the last thirty years
Dialogue: 0,0:09:36.17,0:09:41.35,Default,,0000,0000,0000,,and learnt enough and progressed enough in some sense, which is very good
Dialogue: 0,0:09:41.35,0:09:46.34,Default,,0000,0000,0000,,but yet ignored some other parts, I would like to say, somewhat controversially
Dialogue: 0,0:09:46.34,0:09:50.86,Default,,0000,0000,0000,,we ignored some other parts in computer science, not overall in human knowledge, in computer science
Dialogue: 0,0:09:50.86,0:09:55.87,Default,,0000,0000,0000,,linear algebra, numerical methods which have come out to be very important
Dialogue: 0,0:09:55.87,0:10:02.38,Default,,0000,0000,0000,,so if you asked me twenty years ago whether computing eigenvalues or eigenvectors of a matrix
Dialogue: 0,0:10:02.38,0:10:06.76,Default,,0000,0000,0000,,with a million non-zero entries is practical and important in the future
Dialogue: 0,0:10:06.76,0:10:09.58,Default,,0000,0000,0000,,I would have said no, so would most computer scientists
Dialogue: 0,0:10:09.58,0:10:12.88,Default,,0000,0000,0000,,but in fact this is what is used in basic search
Dialogue: 0,0:10:12.88,0:10:18.94,Default,,0000,0000,0000,,so if you look at how Google does page rank or ranks the pages on the web it's based on eigenvalues
Dialogue: 0,0:10:18.94,0:10:21.82,Default,,0000,0000,0000,,I won't tell you how but it's based on eigenvalues
Dialogue: 0,0:10:21.82,0:10:25.92,Default,,0000,0000,0000,,so in fact it's been a surprise over the last decade or two
Dialogue: 0,0:10:25.92,0:10:29.78,Default,,0000,0000,0000,,that what we might have thought wouldn't scale up, would not be important
Dialogue: 0,0:10:29.78,0:10:31.37,Default,,0000,0000,0000,,has become important
Dialogue: 0,0:10:31.37,0:10:38.61,Default,,0000,0000,0000,,one of them is this, sampling, not basic sampling but in fact most of this talk will be about sampling
Dialogue: 0,0:10:38.61,0:10:45.19,Default,,0000,0000,0000,,but sampling in the context of big data is going to be different from what you might think of as statistical sampling
Dialogue: 0,0:10:45.19,0:10:48.36,Default,,0000,0000,0000,,and that's important, you'll see that
Dialogue: 0,0:10:48.36,0:10:53.97,Default,,0000,0000,0000,,and also very important it turns out is high dimensional geometry, you'll see some of this
Dialogue: 0,0:10:53.97,0:11:00.45,Default,,0000,0000,0000,,so in fact even when you least expect, even for problems which don't look like that
Dialogue: 0,0:11:00.45,0:11:04.81,Default,,0000,0000,0000,,it turns out those problems can be formulated, we'll see some of that
Dialogue: 0,0:11:04.81,0:11:10.86,Default,,0000,0000,0000,,those problems can be formulated as problems to do with let's say ten thousand dimensional vectors
Dialogue: 0,0:11:10.86,0:11:17.41,Default,,0000,0000,0000,,so what does ten thousand dimensional space look like, that turns out to be an important thing to worry about
Dialogue: 0,0:11:17.41,0:11:23.94,Default,,0000,0000,0000,,and probability limit laws, central limit theorem, the very classical probability limit laws, they turn out to be very important
Dialogue: 0,0:11:23.94,0:11:29.65,Default,,0000,0000,0000,,now you can sort of intuitively guess why, if you have lots of data you sort of learn to limit
Dialogue: 0,0:11:29.65,0:11:34.26,Default,,0000,0000,0000,,right? and therefore such things should be more important than let's say basic statistics
Dialogue: 0,0:11:34.26,0:11:38.49,Default,,0000,0000,0000,,and indeed in a lot of these areas we'll have to develop new tools
Dialogue: 0,0:11:38.49,0:11:44.51,Default,,0000,0000,0000,,and we have developed some which handles this big data problem, that's what I'm hoping to convince you
Dialogue: 0,0:11:46.89,0:11:53.96,Default,,0000,0000,0000,,so sampling, so it's clear that if you have a lot of data you have to sample
Dialogue: 0,0:11:54.35,0:12:00.56,Default,,0000,0000,0000,,so that you get a subsample which is smaller and can be handled
Dialogue: 0,0:12:00.56,0:12:05.05,Default,,0000,0000,0000,,but is it just picking uniformly at random a part of the data
Dialogue: 0,0:12:05.05,0:12:08.34,Default,,0000,0000,0000,,is that going to be good enough, that's the question we want to ask right
Dialogue: 0,0:12:08.34,0:12:10.67,Default,,0000,0000,0000,,so we have ten million pieces of data
Dialogue: 0,0:12:10.67,0:12:13.96,Default,,0000,0000,0000,,just uniformly at random pick ten thousand of them and analyze them
Dialogue: 0,0:12:13.96,0:12:15.67,Default,,0000,0000,0000,,is that what we are going to do
Dialogue: 0,0:12:15.67,0:12:19.46,Default,,0000,0000,0000,,that's what we think of as sampling traditionally, perhaps the most elementary method
Dialogue: 0,0:12:19.46,0:12:22.73,Default,,0000,0000,0000,,but here is a problem typically that we want to solve
Dialogue: 0,0:12:22.73,0:12:27.03,Default,,0000,0000,0000,,it's an optimization problem, there are many of those this is one
Dialogue: 0,0:12:27.03,0:12:30.31,Default,,0000,0000,0000,,you have a very large matrix A, m by n matrix
Dialogue: 0,0:12:30.31,0:12:34.55,Default,,0000,0000,0000,,which we want to do what's called principal component analysis on A
Dialogue: 0,0:12:34.55,0:12:37.31,Default,,0000,0000,0000,,if you're not familiar with this bear with me for a moment
Dialogue: 0,0:12:37.31,0:12:40.73,Default,,0000,0000,0000,,you don't have to be at the moment, familiar with this, we'll see that a little more
Dialogue: 0,0:12:40.73,0:12:44.10,Default,,0000,0000,0000,,but a lot of you are familiar with this I'm sure
Dialogue: 0,0:12:44.10,0:12:51.44,Default,,0000,0000,0000,,so that just means we want to find a vector x, so x is an [n, or maybe m] vector of unit length
Dialogue: 0,0:12:51.44,0:12:55.82,Default,,0000,0000,0000,,so that when I multiply A with x and take the length, that's as high as possible
Dialogue: 0,0:12:55.82,0:12:59.48,Default,,0000,0000,0000,,so this turns out to be a basic linear algorithm problem, this is very important right?
Dialogue: 0,0:12:59.48,0:13:04.94,Default,,0000,0000,0000,,so we want to maximize over all unit length vectors x the length of A
Dialogue: 0,0:13:04.94,0:13:08.83,Default,,0000,0000,0000,,now A is very big so we can't do the whole problem perhaps
Dialogue: 0,0:13:08.83,0:13:11.98,Default,,0000,0000,0000,,we can't even necessarily put A into random access memory
Dialogue: 0,0:13:11.98,0:13:14.83,Default,,0000,0000,0000,,so it's going to be difficult to run traditional algorithms, what do we do
Dialogue: 0,0:13:14.83,0:13:21.37,Default,,0000,0000,0000,,somehow we like to say we sample A to get a sample out of it: B
Dialogue: 0,0:13:22.04,0:13:25.85,Default,,0000,0000,0000,,now B might have a smaller number of rows or columns maybe
Dialogue: 0,0:13:25.85,0:13:28.64,Default,,0000,0000,0000,,more zeros or something makes it easier
Dialogue: 0,0:13:29.09,0:13:31.90,Default,,0000,0000,0000,,we won't see particularly what at the moment
Dialogue: 0,0:13:31.90,0:13:34.75,Default,,0000,0000,0000,,we want that B to have this property
Dialogue: 0,0:13:35.20,0:13:38.43,Default,,0000,0000,0000,,see I don't know what x is going to be what I'm going to find
Dialogue: 0,0:13:38.43,0:13:42.88,Default,,0000,0000,0000,,so I better have this property, that for every... red is important...
Dialogue: 0,0:13:42.96,0:13:46.52,Default,,0000,0000,0000,,for every x, Ax and Bx are nearly the same length
Dialogue: 0,0:13:46.58,0:13:52.11,Default,,0000,0000,0000,,if that were true I could deal with B instead of A right? if that were true
Dialogue: 0,0:13:52.11,0:13:55.41,Default,,0000,0000,0000,,if that were not true for every x I cannot do that
Dialogue: 0,0:13:55.41,0:14:00.32,Default,,0000,0000,0000,,because for every x I may find maybe a bad x for which B gives me the wrong answer right?
Dialogue: 0,0:14:00.32,0:14:03.71,Default,,0000,0000,0000,,so how do I do that, that's the kind of sampling we want to do
Dialogue: 0,0:14:03.71,0:14:07.43,Default,,0000,0000,0000,,we want the sample of A that's good for every x
Dialogue: 0,0:14:07.43,0:14:11.30,Default,,0000,0000,0000,,and x is an n dimensional vector, there are infinitely many of them
Dialogue: 0,0:14:11.30,0:14:18.54,Default,,0000,0000,0000,,but a rule of thumb it turns out, in n dimensions you can pretend there are only e to the n vectors
Dialogue: 0,0:14:18.73,0:14:22.67,Default,,0000,0000,0000,,it's not true, there are infinitely many, but you can pretend that there are e to the n vectors
Dialogue: 0,0:14:22.67,0:14:27.79,Default,,0000,0000,0000,,that's still a lot of vectors so this sampling better work for all the vectors, not just for one right?
Dialogue: 0,0:14:27.79,0:14:32.65,Default,,0000,0000,0000,,that's one issue, here's another issue that can come up in sampling
Dialogue: 0,0:14:32.65,0:14:39.90,Default,,0000,0000,0000,,big data means that the data is perhaps too big to be stored in one machine or one unit of memory, some memory
Dialogue: 0,0:14:39.90,0:14:44.19,Default,,0000,0000,0000,,so it may be distributed among many servers
Dialogue: 0,0:14:44.19,0:14:50.24,Default,,0000,0000,0000,,now important sampling, uniform sampling means everybody gets drawn with probability which is the same
Dialogue: 0,0:14:50.24,0:14:55.54,Default,,0000,0000,0000,,importance sampling means some items have a higher probability based on their importance
Dialogue: 0,0:14:55.54,0:15:00.12,Default,,0000,0000,0000,,so importance is sampling each item with probability proportional to its importance
Dialogue: 0,0:15:00.12,0:15:04.44,Default,,0000,0000,0000,,but then all servers must share the vectors of importance right?
Dialogue: 0,0:15:04.44,0:15:08.31,Default,,0000,0000,0000,,they must be sampling with the same vectors to be consistent
Dialogue: 0,0:15:08.31,0:15:13.08,Default,,0000,0000,0000,,now maybe the number of items are so large that this vector is costly to translate
Dialogue: 0,0:15:13.08,0:15:17.80,Default,,0000,0000,0000,,this is the kind of problem we are going to worry about a bit more, both of these problems
Dialogue: 0,0:15:17.80,0:15:20.97,Default,,0000,0000,0000,,by the way I should tell you the talk is not exhaustive
Dialogue: 0,0:15:20.97,0:15:24.20,Default,,0000,0000,0000,,I'm not going to tell you all the problems about big data
Dialogue: 0,0:15:24.20,0:15:26.99,Default,,0000,0000,0000,,that's why we have a whole series of lectures you can see here
Dialogue: 0,0:15:26.99,0:15:32.32,Default,,0000,0000,0000,,we'll have many more interesting lectures in the series in the coming months, one a month roughly
Dialogue: 0,0:15:32.32,0:15:37.07,Default,,0000,0000,0000,,so I'm going to cut to a slice of it and these are some of the problems we'll talk about in detail
Dialogue: 0,0:15:37.07,0:15:40.71,Default,,0000,0000,0000,,so consistency among various servers is one issue
Dialogue: 0,0:15:40.71,0:15:46.38,Default,,0000,0000,0000,,here is a very simple example that you might want to think about mentally, it's a mental exercise, a very simple one
Dialogue: 0,0:15:46.38,0:15:50.72,Default,,0000,0000,0000,,so I want an average of a set of real numbers, a large set of real numbers, a big set right?
Dialogue: 0,0:15:50.72,0:15:55.43,Default,,0000,0000,0000,,now you can just take a sample and take the average, that's an approximation
Dialogue: 0,0:15:55.43,0:16:01.57,Default,,0000,0000,0000,,but the variance may be very high, obviously right? variance is how to spread entries so I might be very high
Dialogue: 0,0:16:01.57,0:16:05.15,Default,,0000,0000,0000,,so how do we achieve low relative variance
Dialogue: 0,0:16:05.15,0:16:08.26,Default,,0000,0000,0000,,now one thing to focus your mind on is the cancellation problem
Dialogue: 0,0:16:08.26,0:16:14.81,Default,,0000,0000,0000,,suppose there are lots of big positive entries and lots of big negative entries and they're canceled out
Dialogue: 0,0:16:14.81,0:16:18.29,Default,,0000,0000,0000,,and the total was zero, the average was zero
Dialogue: 0,0:16:18.29,0:16:23.23,Default,,0000,0000,0000,,so if the answer is zero, if you have to make small relative error, you must get it exactly right
Dialogue: 0,0:16:23.23,0:16:25.64,Default,,0000,0000,0000,,right? if the answer is zero there is no error allowed
Dialogue: 0,0:16:25.64,0:16:29.40,Default,,0000,0000,0000,,allowing one percent error if the answer is zero I have to get exactly the answer
Dialogue: 0,0:16:29.40,0:16:34.47,Default,,0000,0000,0000,,so how do you do that with samples, ok that's our question
Dialogue: 0,0:16:34.47,0:16:39.97,Default,,0000,0000,0000,,now for average, plain simple average it's easy you can just keep a running sum if you read the whole thing
Dialogue: 0,0:16:39.97,0:16:44.47,Default,,0000,0000,0000,,but we'll have to worry about other statistics for which this is not easy right?
Dialogue: 0,0:16:44.47,0:16:50.05,Default,,0000,0000,0000,,so the problem is that the final answer will be very small and therefore the error allowed will be very small
Dialogue: 0,0:16:50.05,0:16:52.75,Default,,0000,0000,0000,,how do you do it with samples
Dialogue: 0,0:16:52.98,0:16:57.87,Default,,0000,0000,0000,,so these are some of the problems I deal with, I'll tell you some of the problems as we go along
Dialogue: 0,0:16:57.87,0:17:04.82,Default,,0000,0000,0000,,sampling is our first major issue, the second major issue we'll deal with in this talk is communication
Dialogue: 0,0:17:04.82,0:17:12.49,Default,,0000,0000,0000,,so again first one definition, what is "big data"? what's a "big data" problem?
Dialogue: 0,0:17:12.49,0:17:16.98,Default,,0000,0000,0000,,let's simply define it as data does not fit into random access memory
Dialogue: 0,0:17:16.98,0:17:23.80,Default,,0000,0000,0000,,so every computer has random access memory, if the data is big enough that doesn't fit in RAM then we call it big data
Dialogue: 0,0:17:23.80,0:17:26.30,Default,,0000,0000,0000,,that's the sort of working definition of it
Dialogue: 0,0:17:26.30,0:17:30.92,Default,,0000,0000,0000,,so let's say we run a matrix algorithm, you all know traditional matrix algorithms
Dialogue: 0,0:17:30.92,0:17:38.21,Default,,0000,0000,0000,,as the algorithm is running along, a step of the algorithm may say "bring me the i j entry of the matrix"
Dialogue: 0,0:17:38.21,0:17:42.39,Default,,0000,0000,0000,,and in analyzing or implementing that we assume that can be done in unit time
Dialogue: 0,0:17:42.39,0:17:47.73,Default,,0000,0000,0000,,because if you had random access memory you can quickly access, that's what random access means
Dialogue: 0,0:17:47.73,0:17:52.46,Default,,0000,0000,0000,,you can quickly access let's say each entry of the matrix by putting down its index
Dialogue: 0,0:17:52.46,0:17:55.79,Default,,0000,0000,0000,,now this we cannot do if it doesn't fit into RAM
Dialogue: 0,0:17:55.79,0:18:01.75,Default,,0000,0000,0000,,so in fact the traditional paradigm of algorithm design assumes you can do it in unit time
Dialogue: 0,0:18:01.75,0:18:05.91,Default,,0000,0000,0000,,and that already fails that's the primitive that fails
Dialogue: 0,0:18:05.91,0:18:13.76,Default,,0000,0000,0000,,so people devised models that avoid this assumption, one model is the streaming model
Dialogue: 0,0:18:13.76,0:18:20.88,Default,,0000,0000,0000,,the streaming model says data is written on some external memory and it's steaming, it just goes by
Dialogue: 0,0:18:20.88,0:18:23.77,Default,,0000,0000,0000,,you can only read once from right to left
Dialogue: 0,0:18:23.77,0:18:30.17,Default,,0000,0000,0000,,I'll draw a picture here, I've been lazy, I should have drawn pictures on the slide but that takes a long time
Dialogue: 0,0:18:30.17,0:18:39.91,Default,,0000,0000,0000,,this is a stream if you will, the close part you are sitting here reading it you can only read once and it goes away
Dialogue: 0,0:18:39.91,0:18:46.86,Default,,0000,0000,0000,,well that per say is not good enough, there is a random access memory which is much smaller than this stream
Dialogue: 0,0:18:46.86,0:18:49.26,Default,,0000,0000,0000,,so I'll write this I've a very small
Dialogue: 0,0:18:49.26,0:18:55.20,Default,,0000,0000,0000,,so you can remember some of this stream, let's say by sampling you've sampled some of this stream
Dialogue: 0,0:18:55.20,0:18:57.52,Default,,0000,0000,0000,,uniform sampling is not going to do much
Dialogue: 0,0:18:57.52,0:19:03.78,Default,,0000,0000,0000,,good but there are sophisticated methods that do actually work and solve a lot of problems in this model
Dialogue: 0,0:19:03.78,0:19:12.33,Default,,0000,0000,0000,,it's I should say the streaming model was the object of study even before the phrase big data was coined
Dialogue: 0,0:19:12.33,0:19:18.73,Default,,0000,0000,0000,,it's studied for the last two decades and there's quite a body of very nice results
Dialogue: 0,0:19:18.73,0:19:26.31,Default,,0000,0000,0000,,I won't tell you the entire body the way I'm structuring this talk cuz I'm going to give you some more or less self-contained vignettes
Dialogue: 0,0:19:26.31,0:19:34.13,Default,,0000,0000,0000,,so we'll see one vignette from streaming data, streaming model, but I won't be able to show you a lot of stuff right?
Dialogue: 0,0:19:35.01,0:19:38.74,Default,,0000,0000,0000,,so now where does communication come in
Dialogue: 0,0:19:38.74,0:19:50.66,Default,,0000,0000,0000,,so the random access memory you can think of as communicating info information pertaining to the first part of the input to the second part
Dialogue: 0,0:19:51.02,0:19:54.34,Default,,0000,0000,0000,,so that's how the communication takes place
Dialogue: 0,0:19:56.13,0:20:00.04,Default,,0000,0000,0000,,so when I read up to this point this is all gone
Dialogue: 0,0:20:02.84,0:20:07.93,Default,,0000,0000,0000,,whatever information I need of that is stored in my random access memory
Dialogue: 0,0:20:07.93,0:20:16.62,Default,,0000,0000,0000,,so in some sense this bit is passing on is communicating some information to the second part by writing something on the random access memory
Dialogue: 0,0:20:16.62,0:20:23.97,Default,,0000,0000,0000,,that's one way to think of it so this analogy actually relates space and communication
Dialogue: 0,0:20:23.97,0:20:29.72,Default,,0000,0000,0000,,and this has been studied widely and it's is used to derive lower bounds in space
Dialogue: 0,0:20:29.72,0:20:37.40,Default,,0000,0000,0000,,ok a subject that I won't touch, very much into a [dista? distant?] high level concept somehow space is related to communication
Dialogue: 0,0:20:37.40,0:20:42.29,Default,,0000,0000,0000,,cuz you've got to write down what you want to communicate to the rest of the input [unintelligible word]
Dialogue: 0,0:20:42.79,0:20:50.95,Default,,0000,0000,0000,,so so far I will say it's been fairly up in the sky but one more slide of that and then we'll come to a very particular piece
Dialogue: 0,0:20:50.95,0:20:55.04,Default,,0000,0000,0000,,so streaming is one model that's been studied a lot
Dialogue: 0,0:20:55.04,0:21:00.20,Default,,0000,0000,0000,,a little less restrictive model of big data is that data is split among many servers
Dialogue: 0,0:21:00.20,0:21:03.45,Default,,0000,0000,0000,,but number of servers is much smaller
Dialogue: 0,0:21:03.45,0:21:14.77,Default,,0000,0000,0000,,now you might have, you could think of streaming as if each bit is sitting on a different computer, different server
Dialogue: 0,0:21:15.87,0:21:22.31,Default,,0000,0000,0000,,that's one way to do this, the only way they communicate is through RAM right? otherwise they cannot communicate
Dialogue: 0,0:21:22.31,0:21:30.46,Default,,0000,0000,0000,,now that's however if the length of this stream is n this is thinking of it as n servers
Dialogue: 0,0:21:32.37,0:21:36.70,Default,,0000,0000,0000,,now you can argue that a lot of real problems have data that's not so big
Dialogue: 0,0:21:36.70,0:21:41.41,Default,,0000,0000,0000,,in fact you can accommodate it on log n servers
Dialogue: 0,0:21:41.41,0:21:44.46,Default,,0000,0000,0000,,many smaller many fewer servers than the amount of data
Dialogue: 0,0:21:44.46,0:21:50.93,Default,,0000,0000,0000,,so that's a better model perhaps for a lot of data and that's one other model we'll study, we'll study both of these
Dialogue: 0,0:21:52.24,0:21:56.36,Default,,0000,0000,0000,,now again up to this point the talk has been about generalities
Dialogue: 0,0:21:56.36,0:22:01.28,Default,,0000,0000,0000,,now I want to zoom in on some examples which are self-contained examples for what is going on
Dialogue: 0,0:22:05.04,0:22:06.96,Default,,0000,0000,0000,,ok so let's start with an example
Dialogue: 0,0:22:06.96,0:22:11.27,Default,,0000,0000,0000,,this actually was a [starting/study?] example and motivated a lot of research on streaming algorithms
Dialogue: 0,0:22:11.27,0:22:17.56,Default,,0000,0000,0000,,we have a large network with lots of computers sending messages to each other
Dialogue: 0,0:22:18.44,0:22:21.03,Default,,0000,0000,0000,,there are routers which route the message
Dialogue: 0,0:22:21.03,0:22:27.40,Default,,0000,0000,0000,,so there are n computers, n is very large, perhaps in a million
Dialogue: 0,0:22:28.24,0:22:32.94,Default,,0000,0000,0000,,but there are routers which are square boxes, maybe there are only ten of them or a hundred of them
Dialogue: 0,0:22:32.94,0:22:40.84,Default,,0000,0000,0000,,and so a message from here to here might go through this router and they might send it to this and they might send it to that
Dialogue: 0,0:22:40.84,0:22:46.63,Default,,0000,0000,0000,,and let's say the message finally wants to go there, it'll go there so it goes through three routers and then goes to the end destination
Dialogue: 0,0:22:47.39,0:22:54.70,Default,,0000,0000,0000,,so we won't worry about the mechanics of routing let's just study what happens here for the logs of the messages
Dialogue: 0,0:22:54.70,0:22:58.04,Default,,0000,0000,0000,,so the last router that sends it to the destination
Dialogue: 0,0:22:58.04,0:23:01.64,Default,,0000,0000,0000,,the last one keeps a log of the message
Dialogue: 0,0:23:01.86,0:23:09.50,Default,,0000,0000,0000,,the log just has for us the sending computers name i, and the length of the message, that's the log it keeps
Dialogue: 0,0:23:10.08,0:23:18.27,Default,,0000,0000,0000,,and you want to analyze these logs, you want to do statistics on these logs to better improve the efficiency of the network that's there that's all A
Dialogue: 0,0:23:19.32,0:23:25.90,Default,,0000,0000,0000,,so let's say Ai is the total length of all the messages sent by computer i in one day let's say
Dialogue: 0,0:23:25.90,0:23:31.50,Default,,0000,0000,0000,,and you want statistics of Ai, the statistics are important to again optimize the network
Dialogue: 0,0:23:31.50,0:23:33.83,Default,,0000,0000,0000,,that's the problem we want to solve
Dialogue: 0,0:23:33.83,0:23:39.72,Default,,0000,0000,0000,,now statistics one thing you want to do is perhaps the average of Ai or the total of Ai
Dialogue: 0,0:23:39.72,0:23:42.73,Default,,0000,0000,0000,,but you may want the variance, you may want the second moment
Dialogue: 0,0:23:42.73,0:23:46.50,Default,,0000,0000,0000,,so that's just 1 over n, there are n computers
Dialogue: 0,0:23:46.50,0:23:53.96,Default,,0000,0000,0000,,and n Ai, you sum the Ai squared and then take the average right? let's say we want to do that
Dialogue: 0,0:23:54.41,0:23:56.51,Default,,0000,0000,0000,,now it's that simple let's see what happens
Dialogue: 0,0:23:56.51,0:23:59.57,Default,,0000,0000,0000,,the trouble is no router knows Ai in full
Dialogue: 0,0:23:59.57,0:24:06.79,Default,,0000,0000,0000,,so i was this computer, it might have sent messages which have gone through many routers
Dialogue: 0,0:24:06.79,0:24:15.46,Default,,0000,0000,0000,,and some of them went there, some of them went there, some of them went there, so nobody knows the full value of Ai right?
Dialogue: 0,0:24:15.93,0:24:22.76,Default,,0000,0000,0000,,so if you want to find this you seem to have to communicate all the partial Ai and that requires O(n) space
Dialogue: 0,0:24:22.76,0:24:27.87,Default,,0000,0000,0000,,because there are n computers, you have to communicate for each one the total from each router
Dialogue: 0,0:24:27.87,0:24:31.85,Default,,0000,0000,0000,,so think of the problem actually just for two routers, already interesting
Dialogue: 0,0:24:31.85,0:24:35.10,Default,,0000,0000,0000,,now let's get to the routers formulated abstractly
Dialogue: 0,0:24:35.10,0:24:38.52,Default,,0000,0000,0000,,each router has just an n vector so this is where vectors come in
Dialogue: 0,0:24:38.52,0:24:41.34,Default,,0000,0000,0000,,there are no vectors to start with right? this not a geometry problem
Dialogue: 0,0:24:41.34,0:24:43.91,Default,,0000,0000,0000,,but here I am going to make up vectors
Dialogue: 0,0:24:43.91,0:24:48.08,Default,,0000,0000,0000,,so each router you can think of as having an n vector
Dialogue: 0,0:24:48.08,0:24:54.41,Default,,0000,0000,0000,,and the components tell you the total length of all messages from each computer that's been logged by this router
Dialogue: 0,0:24:54.41,0:25:02.56,Default,,0000,0000,0000,,so for computer i all the messages that were logged with this router, that total is kept there let's say that's a vector
Dialogue: 0,0:25:02.56,0:25:08.56,Default,,0000,0000,0000,,so again vectors have come in surreptitiously they have nothing to do with the problem but we've formulated the vectors
Dialogue: 0,0:25:08.56,0:25:17.71,Default,,0000,0000,0000,,we'll see the advantage of that, it's not a bookkeeping device that I put down vectors it's for keeping track of this it's more than a bookkeeping device
Dialogue: 0,0:25:17.71,0:25:23.28,Default,,0000,0000,0000,,so what we want to find I say alas inner sum n over t
Dialogue: 0,0:25:23.28,0:25:30.69,Default,,0000,0000,0000,,so we want to take the sum over all routers of the total length of message logged by that router coming from i
Dialogue: 0,0:25:30.69,0:25:35.17,Default,,0000,0000,0000,,I want to sum inside first, then square and then add
Dialogue: 0,0:25:35.17,0:25:37.98,Default,,0000,0000,0000,,it would have been nice if the sum was outside
Dialogue: 0,0:25:37.98,0:25:46.83,Default,,0000,0000,0000,,the trouble is the sum is inside so you have to first total over all routers for each i, square, and then sum over all i right?
Dialogue: 0,0:25:46.83,0:25:51.08,Default,,0000,0000,0000,,and that's a problem because these route this information is not contained in any body
Dialogue: 0,0:25:51.08,0:25:53.79,Default,,0000,0000,0000,,now if you don't follow the exact mechanics that's fine
Dialogue: 0,0:25:53.79,0:25:57.26,Default,,0000,0000,0000,,I'm going to abstract it even more that makes life a lot simpler
Dialogue: 0,0:25:57.26,0:26:00.05,Default,,0000,0000,0000,,so here's a vector problem this is the abstract problem we want to solve
Dialogue: 0,0:26:00.05,0:26:04.20,Default,,0000,0000,0000,,there's an n component vector, bold face is vectors
Dialogue: 0,0:26:04.21,0:26:06.65,Default,,0000,0000,0000,,a sub t residing in server t
Dialogue: 0,0:26:06.70,0:26:13.08,Default,,0000,0000,0000,,and you want to take the sum and you want to estimate the length squared of the sum right? sum of squares of length
Dialogue: 0,0:26:13.24,0:26:16.82,Default,,0000,0000,0000,,we want to sum the vectors component wise and then take the sum of squares
Dialogue: 0,0:26:17.07,0:26:20.36,Default,,0000,0000,0000,,now if we had the whole vector and we could write it down this is a trivial problem
Dialogue: 0,0:26:20.36,0:26:28.61,Default,,0000,0000,0000,,but we don't want to write down the whole vector so we want to use very little communication compared to the number of components, how do we do that
Dialogue: 0,0:26:28.61,0:26:35.46,Default,,0000,0000,0000,,ok so maybe I can sample k of the n components of the vector v right?
Dialogue: 0,0:26:35.46,0:26:38.62,Default,,0000,0000,0000,,v has too many components I don't want to write them all down
Dialogue: 0,0:26:38.62,0:26:41.73,Default,,0000,0000,0000,,maybe I can sample k and just collect information on that
Dialogue: 0,0:26:41.73,0:26:45.65,Default,,0000,0000,0000,,is that any good and the answer is a gentle no the variance can be very high
Dialogue: 0,0:26:45.65,0:26:51.14,Default,,0000,0000,0000,,some components can be much higher than other components so this will be a loss if you do that
Dialogue: 0,0:26:51.14,0:26:57.69,Default,,0000,0000,0000,,well here is a beautiful theorem which is a geometry theorem, now you see that geometry is necessary and useful
Dialogue: 0,0:26:57.69,0:27:00.50,Default,,0000,0000,0000,,called a Johnson Lindenstrauss Theorem
Dialogue: 0,0:27:00.50,0:27:05.40,Default,,0000,0000,0000,,if you perhaps don't remember anything from this talk hopefully you can take this home if you didn't see it before
Dialogue: 0,0:27:05.40,0:27:07.75,Default,,0000,0000,0000,,so this theorem says the following
Dialogue: 0,0:27:07.75,0:27:15.88,Default,,0000,0000,0000,,if I pick a matrix R which has a small number of rows, k is supposed to be small
Dialogue: 0,0:27:15.88,0:27:22.81,Default,,0000,0000,0000,,matrix with independent random random unit vectors, so they're independent random vectors with unit length as rows
Dialogue: 0,0:27:22.86,0:27:31.11,Default,,0000,0000,0000,,then I take the length of R times v and that tells me the length of v but there is a little scale factor here don't worry about that
Dialogue: 0,0:27:31.11,0:27:34.11,Default,,0000,0000,0000,,which is a known factor, we know this so we don't care
Dialogue: 0,0:27:34.46,0:27:39.50,Default,,0000,0000,0000,,it tells me the length of v, so a little schematic picture here
Dialogue: 0,0:27:42.54,0:27:46.61,Default,,0000,0000,0000,,so we had a vector x with a lot of components, n components
Dialogue: 0,0:27:47.02,0:27:52.47,Default,,0000,0000,0000,,and we multiply it by an R which is a very small number of rows
Dialogue: 0,0:27:52.47,0:27:57.13,Default,,0000,0000,0000,,it becomes now a k vector, this is R x
Dialogue: 0,0:27:58.35,0:28:01.69,Default,,0000,0000,0000,,and it's enough to find the length of that right?
Dialogue: 0,0:28:01.69,0:28:09.57,Default,,0000,0000,0000,,so what we did was again we had to find the length of some n things squared, instead this says a sample of k things will do
Dialogue: 0,0:28:09.57,0:28:12.33,Default,,0000,0000,0000,,not a plain sample that won't do right?
Dialogue: 0,0:28:12.33,0:28:17.12,Default,,0000,0000,0000,,but this kind of random vector then multiplied it will do
Dialogue: 0,0:28:18.19,0:28:22.36,Default,,0000,0000,0000,,the proof is not difficult I mean obviously I won't do the proof here but the proof is not difficult
Dialogue: 0,0:28:22.36,0:28:27.53,Default,,0000,0000,0000,,but what it says is that a sample of k components, plain components won't do
Dialogue: 0,0:28:27.53,0:28:30.76,Default,,0000,0000,0000,,but a sample of k combinations of components will do
Dialogue: 0,0:28:37.19,0:28:47.91,Default,,0000,0000,0000,,so now the trouble is this if all servers happen to have the same random matrix R they can do their own R times a t and send these vectors
Dialogue: 0,0:28:47.91,0:28:51.59,Default,,0000,0000,0000,,now these are only k vectors they are very small ok?
Dialogue: 0,0:28:51.59,0:29:00.69,Default,,0000,0000,0000,,you can send these vectors over to a central processor which can sum the vectors and since things are linear it can find R times v and compute the length
Dialogue: 0,0:29:00.69,0:29:06.51,Default,,0000,0000,0000,,but you need to know the same R, everybody needs the same R ok?
Dialogue: 0,0:29:06.51,0:29:11.44,Default,,0000,0000,0000,,how do we do that, we need a lot of communication it seems to agree on one R
Dialogue: 0,0:29:11.44,0:29:17.87,Default,,0000,0000,0000,,and this is going to be this has been a simple problem and I want to isolate this problem now, forget the vector problem
Dialogue: 0,0:29:17.87,0:29:22.58,Default,,0000,0000,0000,,how to share randomness this is a very important problem it turns out in many areas
Dialogue: 0,0:29:22.58,0:29:31.69,Default,,0000,0000,0000,,how can many servers share the same they have to agree on the same n bit random string without transmitting random R right?
Dialogue: 0,0:29:31.69,0:29:35.28,Default,,0000,0000,0000,,they all have to have exactly the same random string how do they do that
Dialogue: 0,0:29:38.12,0:29:42.30,Default,,0000,0000,0000,,the string has n bits of information so it seems to need n bits of communication
Dialogue: 0,0:29:42.60,0:29:51.67,Default,,0000,0000,0000,,we'll see how to get around that but for many applications of this problem as well as other problems in distributed computing, complexity and cryptography
Dialogue: 0,0:29:51.67,0:29:57.92,Default,,0000,0000,0000,,we don't need all the components to be independent we need only what's called 4-way independence
Dialogue: 0,0:29:57.92,0:30:03.12,Default,,0000,0000,0000,,now 4-way independence is every set of 4 or fewer bits are independent
Dialogue: 0,0:30:03.37,0:30:07.44,Default,,0000,0000,0000,,so full independence is the entire thing should be independent mutually
Dialogue: 0,0:30:07.44,0:30:15.23,Default,,0000,0000,0000,,but we only need every collection of 4 bits or 3 bits or 2 bits or 1 bit to be independent, that's enough it turns out
Dialogue: 0,0:30:15.23,0:30:26.38,Default,,0000,0000,0000,,now 4-way independent things are easier to get coding theory, and coding theory gives us this
Dialogue: 0,0:30:26.60,0:30:28.57,Default,,0000,0000,0000,,it's actually a very classic code
Dialogue: 0,0:30:31.88,0:30:36.77,Default,,0000,0000,0000,,so here's one way to get the 4-way independent stream very long stream but a 4-way independent
Dialogue: 0,0:30:36.77,0:30:50.42,Default,,0000,0000,0000,,I pick at random coefficients, 4 coefficients and form a degree 3 polynomial, a not plus a 1 x plus a 2 x squared plus a 3 x cubed it's a real polynomial right? just the polynomial of degree 3
Dialogue: 0,0:30:50.42,0:30:56.82,Default,,0000,0000,0000,,then I evaluate this polynomial at n points, p at 1, p at 2, and p at n, and so on
Dialogue: 0,0:30:56.82,0:31:02.62,Default,,0000,0000,0000,,what happens is that this string which is much longer than 4, right? same length, is 4-way independent
Dialogue: 0,0:31:02.62,0:31:06.01,Default,,0000,0000,0000,,and you have to do it over a finite field but this is the idea
Dialogue: 0,0:31:06.01,0:31:14.41,Default,,0000,0000,0000,,there are four degrees of freedom right? four degrees of independence, a 1, a not, a 1, a 3, a 4, and that suffices to give 4-way independence
Dialogue: 0,0:31:14.41,0:31:20.72,Default,,0000,0000,0000,,now these are called pseudo-random sequences and many of you will notice now the answer for the communication problem is simple
Dialogue: 0,0:31:20.72,0:31:26.56,Default,,0000,0000,0000,,you just generate the seed, some central location, and transmit the seed
Dialogue: 0,0:31:26.56,0:31:30.44,Default,,0000,0000,0000,,a not, a 1, a 2, a 3, centrally and transmit to all servers
Dialogue: 0,0:31:30.44,0:31:36.88,Default,,0000,0000,0000,,they can individually find the string, the strings will all be the same right? it's the same string so they'll all be the same
Dialogue: 0,0:31:36.88,0:31:43.08,Default,,0000,0000,0000,,so they've agreed on an n length random string with just very little transmission
Dialogue: 0,0:31:43.52,0:31:48.31,Default,,0000,0000,0000,,and that the communicating randomness like this we've said is an important derivative
Dialogue: 0,0:31:50.54,0:31:56.56,Default,,0000,0000,0000,,so finishing the vector problem as I've said there was an important paper due to Alon Szegedy about fifteen years back
Dialogue: 0,0:31:56.56,0:32:02.40,Default,,0000,0000,0000,,which started the field of streaming algorithms so I want to go through what we have more or less summarizing what we have
Dialogue: 0,0:32:02.40,0:32:10.06,Default,,0000,0000,0000,,there's a central processor which generates a seed for the random matrix, it transmits the seeds to all the servers
Dialogue: 0,0:32:10.06,0:32:16.92,Default,,0000,0000,0000,,so let's just find, apply R to their individual so they build up the whole matrix from the seed
Dialogue: 0,0:32:16.92,0:32:25.43,Default,,0000,0000,0000,,apply this and then send it to the central processor which sums up all these things and then finds the length
Dialogue: 0,0:32:26.42,0:32:35.62,Default,,0000,0000,0000,,if server t did not have it explicitly, so it's possible that server t did not have it explicitly but we had only the streaming model
Dialogue: 0,0:32:36.61,0:32:46.84,Default,,0000,0000,0000,,it only has a log of every entry but the logs are in arbitrary order, they're not sorted, so they're not sorted altogether for one i
Dialogue: 0,0:32:46.84,0:33:01.04,Default,,0000,0000,0000,,but whenever I see that the i source has sent a message of length a then a little thought will tell you that all I have to do to R times a t is update it by adding a times column i of R to R times a t
Dialogue: 0,0:33:01.04,0:33:12.91,Default,,0000,0000,0000,,again it's not important to remember the details but the point is whatever order I give these logs I can update my matrix vector product and you can do this in the streaming model once I have the seeds
Dialogue: 0,0:33:15.48,0:33:37.14,Default,,0000,0000,0000,,now that [someone asks a question, I can't hear] so v is not available locally [audience speaks again] we have only these a t, v is the sume of what's available locally so we use the vector sum of vectors available at each server
Dialogue: 0,0:33:37.14,0:33:49.67,Default,,0000,0000,0000,,and we want to be able to estimate the length of the sum rather than each individual one, ask me other questions that you have we can go slow
Dialogue: 0,0:33:49.67,0:33:52.98,Default,,0000,0000,0000,,[someone may be asking a question, I can't hear anything just a long pause]
Dialogue: 0,0:33:56.66,0:34:10.02,Default,,0000,0000,0000,,I'm talking about scenarios where each server has a certain amount of information but we want the statistics of the total and we are not allowed to just transmit the whole vector that's too costly
Dialogue: 0,0:34:13.74,0:34:17.30,Default,,0000,0000,0000,,briefly how to [say one/save on?] communication
Dialogue: 0,0:34:21.59,0:34:32.80,Default,,0000,0000,0000,,so I want to study something called higher order correlations. let's say you have time-series data of n events and over T times because n and T are supposed to be large this is again large data
Dialogue: 0,0:34:32.99,0:34:44.33,Default,,0000,0000,0000,,so Aij is 1 over 0 depending on whether event Ei happened at time t j, so whether the [IT?] event happened at time j if so then Aij is 1 over 0
Dialogue: 0,0:34:44.50,0:34:52.85,Default,,0000,0000,0000,,so this is time series right? time series just means for each event I have a bit, I have a long string of zeros and ones which tells you when the event happened
Dialogue: 0,0:34:52.85,0:35:05.42,Default,,0000,0000,0000,,now pairwise correlations wants to know how many pairs of events co-occured, so you want triples Ei, E1, E2, t, so that E1 and E2 both occur at time t
Dialogue: 0,0:35:05.42,0:35:14.89,Default,,0000,0000,0000,,so if E1 and E2 both co-occur at time t, how many times do they co-occur? add them up, this is clearly of interest. I won't motivate it but I'll give you some examples in a minute
Dialogue: 0,0:35:14.89,0:35:29.85,Default,,0000,0000,0000,,but it turns out higher order correlations are also of interest, we have four events E1 through E4, and t times three time steps, and we want to know how many such things where all four occurred at all these three times right?
Dialogue: 0,0:35:29.85,0:35:45.17,Default,,0000,0000,0000,,it turns out there is something actually [national or natural?] motivations which I won't describe in detail, these higher order correlations are important in a lot of contexts one is neuron styling but uh we'll see a couple of other examples on the slide
Dialogue: 0,0:35:45.17,0:35:59.99,Default,,0000,0000,0000,,now big data again means that events are split up on servers, so no server knows all the events, each server knows some of the events. so each server knows some of these time series, nobody knows all of them.
Dialogue: 0,0:36:01.62,0:36:18.07,Default,,0000,0000,0000,,another example of this is from commercial data right? customers buying products, each customer is on one server and you want to know for instance how many triples of customers and products are there so that all three bought quite a bit of these two
Dialogue: 0,0:36:18.07,0:36:30.54,Default,,0000,0000,0000,,so these kind so of analysis are necessary for instance they optimize what products should be put where and so on and so forth. again the details of what the exact problem is not important but it's this correlation that we want to know
Dialogue: 0,0:36:33.84,0:36:45.95,Default,,0000,0000,0000,,so documents and terms in an information retrieval are co-occurrence of terms trigrams or triples of terms, whether they co-occur and how many times they co-occur are important, that's a very exact similar problem to this
Dialogue: 0,0:36:48.27,0:36:54.39,Default,,0000,0000,0000,,here's the abstract formulation, you have a bipartite graph so i can draw it on the board
Dialogue: 0,0:37:04.71,0:37:23.06,Default,,0000,0000,0000,,so one side we have events, the other side we have time steps, and you put an arrow if it occurred. so some things occurred some things didn't.
Dialogue: 0,0:37:26.70,0:37:35.86,Default,,0000,0000,0000,,this whole graph is not presented on any server. servers only know for one particular event what are all the time steps that it happened, perhaps these two.
Dialogue: 0,0:37:37.27,0:37:47.30,Default,,0000,0000,0000,,and some other server, maybe the elliptic server might know some of the other stuff, the square server might know some of this, that's the situation
Dialogue: 0,0:37:51.78,0:38:05.22,Default,,0000,0000,0000,,so uh and you can think of this ok here is a concrete version of this problem, estimate the number of (2, 4) sub-graphs with each of the two left vertices connected to at least three of the four right vertices
Dialogue: 0,0:38:05.22,0:38:17.09,Default,,0000,0000,0000,,so it's again not very important exactly what happens but here's a graph with two vertices on the left and four on the left, four on the right, and you want to study the connections
Dialogue: 0,0:38:17.09,0:38:26.34,Default,,0000,0000,0000,,you want to do the averages of this over all sets of two and four right? that's total you want the statistics of it, and the point is the data is split
Dialogue: 0,0:38:27.12,0:38:38.35,Default,,0000,0000,0000,,so you can formulate this as a bipartite graph with left and right vertices, it turns out this problem, and i'm going to go over this quickly if you don't follow it don't worry about it
Dialogue: 0,0:38:38.35,0:38:48.01,Default,,0000,0000,0000,,this problem is more or less exactly the vectors problem for reasons that should become clear, so it's a vector problem in a very high number of dimensions
Dialogue: 0,0:38:48.01,0:38:54.88,Default,,0000,0000,0000,,so if there are 10 right vertices you form a 10 to the 4 component vector v and it's that vector's length we want.
Dialogue: 0,0:38:54.88,0:39:03.82,Default,,0000,0000,0000,,so component v i is the number of left vertices connected to at least three of the four right vertices, again you need not follow exactly all the details but it's very simple
Dialogue: 0,0:39:03.82,0:39:18.45,Default,,0000,0000,0000,,and the answer we want is more or less the squared length of v, it's the same problem that we had earlier but in a very high dimensional space, so if there are n vertices here it will be n [choose?] four dimensions so n to the four dimensional space
Dialogue: 0,0:39:18.45,0:39:23.51,Default,,0000,0000,0000,,so this is an example where the number of dimensions is even bigger than what you started with
Dialogue: 0,0:39:24.63,0:39:35.51,Default,,0000,0000,0000,,now it turns out that can be computed by the previous algorithm but you don't in the neuron firing case it turns out two is not enough, you want higher order correlations than two
Dialogue: 0,0:39:35.51,0:39:45.27,Default,,0000,0000,0000,,so then uh sorry so what we did is only for two, what do we do for higher order? now I'm going to go over this quickly but in the abstract setting now
Dialogue: 0,0:39:45.27,0:39:54.98,Default,,0000,0000,0000,,we have a vector v with n non-negative components, we want.. which is actually not in one place it's a some of vectors on different servers, again the proof is same as before
Dialogue: 0,0:39:54.98,0:40:04.70,Default,,0000,0000,0000,,but now we want to estimate the k [moment?] we want the sum of the k powers, k is larger than two perhaps, k is three, four, k is something else, how do we do that?
Dialogue: 0,0:40:04.70,0:40:15.67,Default,,0000,0000,0000,,the methods for two don't apply, this beautiful theorem [which aries?] doesn't work any more, it turns out that sort of theorem only works for two no more for anything higher than two
Dialogue: 0,0:40:15.67,0:40:32.99,Default,,0000,0000,0000,,but in a recent paper we showed that it can be done with not too much communication, it turns out to be enough for each server to do its own important sampling according to its own vectors' components raised to the k power
Dialogue: 0,0:40:32.99,0:40:44.37,Default,,0000,0000,0000,,so I would have liked to draw samples according to [sum or some?] vector to the k power, it suffices it's important to do it individually each server and then exchange a bit of communication,
Dialogue: 0,0:40:44.37,0:40:54.47,Default,,0000,0000,0000,,this is not trivial, but I won't tell you the details, but important sampling per server is enough, that's the point of the story
Dialogue: 0,0:40:54.47,0:41:10.04,Default,,0000,0000,0000,,I want to go to the next topic, perhaps if there are any questions briefly we can, I don't know whether there are any, if I've gone fast enough that maybe you're all lost which is fine then [audience laughs] any questions?
Dialogue: 0,0:41:12.70,0:41:24.29,Default,,0000,0000,0000,,[someone asks a question, unintelligible] so important sampling, uniform sampling is every item is uniformly likely to be picked, important sampling means the probabilities are not uniform that's all
Dialogue: 0,0:41:24.79,0:41:36.24,Default,,0000,0000,0000,,[another question, unintelligible] how do you pick the importance? in this case it's proportional in the previous slide it was proportional to this k power v i
Dialogue: 0,0:41:37.01,0:41:43.30,Default,,0000,0000,0000,,that is what we'd like to do but we don't have the vector v so instead it turns out each server its own
Dialogue: 0,0:41:43.85,0:42:01.80,Default,,0000,0000,0000,,[unintelligible question] what to have any server? [audience clarifies] ah it's uh maybe um I'd want to adjust because you can speed up competition, but here we are looking at data being so big that it is being put on many servers
Dialogue: 0,0:42:01.80,0:42:11.59,Default,,0000,0000,0000,,or in a network routing case there are different routers they just log messages right? each one. if you want them all to collect in one place you need a huge amount of memory so you can't do that.
Dialogue: 0,0:42:14.02,0:42:26.84,Default,,0000,0000,0000,,anything else? [unintelligible question] yeah so the answer to that is I'm a theoretician. [audience laughs]
Dialogue: 0,0:42:27.84,0:42:33.39,Default,,0000,0000,0000,,well you know this is the model in which you can. k typically wants to be small
Dialogue: 0,0:42:33.39,0:42:44.61,Default,,0000,0000,0000,,[unintelligible question] of the dimension of the number of rows it's logarithmic actually, it's only logarithmic, it's much smaller than [m or n?] so Johnson-Lindenstrauss theorem would logarithmic
Dialogue: 0,0:42:45.24,0:42:53.09,Default,,0000,0000,0000,,[unintelligible comment] no I thought you meant this scale [laughter] this scale is also [unintelligible]
Dialogue: 0,0:42:53.09,0:43:00.03,Default,,0000,0000,0000,,anything else? ok good. so let me quickly do principal component analysis.
Dialogue: 0,0:43:00.03,0:43:10.82,Default,,0000,0000,0000,,so we have a large matrix in this case and the matrix you can think of with each row is a data point, there are m points in n space, n dimensions
Dialogue: 0,0:43:10.82,0:43:20.43,Default,,0000,0000,0000,,the number of points is much larger than dimensions, so maybe they all live in ten dimensional space and there are a million points, or two hundred dimensional space there are a million points right?
Dialogue: 0,0:43:20.43,0:43:32.63,Default,,0000,0000,0000,,so basic concept of linear algebra again this is a recapsulating this you want to find the Unit Vector x which maximizes the length of Ax, that's called the first principal component
Dialogue: 0,0:43:32.63,0:43:39.39,Default,,0000,0000,0000,,Unit Vector y perpendicular to the first principal component maximizing this called the second principal component and so on
Dialogue: 0,0:43:39.40,0:43:43.35,Default,,0000,0000,0000,,again a lot of you are familiar with this, if not just think of it as an optimization point
Dialogue: 0,0:43:44.96,0:43:48.20,Default,,0000,0000,0000,,very Nice Linear Algebra Theory, for many problems you want to find these
Dialogue: 0,0:43:48.82,0:43:53.48,Default,,0000,0000,0000,,now big data the matrix x of A may not be stored all on one server
Dialogue: 0,0:43:54.16,0:44:01.90,Default,,0000,0000,0000,,here is a simple model we can think of, there are many servers, each stores a similar dimension matrix,
Dialogue: 0,0:44:01.90,0:44:08.99,Default,,0000,0000,0000,,so the whole dimension is there, but maybe there are a lot of zeros in each server so the data [in each of them?] is smaller
Dialogue: 0,0:44:08.100,0:44:14.78,Default,,0000,0000,0000,,and you want to deal with the sum of all the matrices but you don't want to communicate all of that right?
Dialogue: 0,0:44:14.78,0:44:22.23,Default,,0000,0000,0000,,so you want to find principal components of the sum but communicate only a tiny fraction of the whole data amongst them
Dialogue: 0,0:44:25.06,0:44:32.95,Default,,0000,0000,0000,,so PCA for distributed data, server t has matrix A t you want to sum and take the principal component
Dialogue: 0,0:44:32.95,0:44:40.85,Default,,0000,0000,0000,,how do we do that? again uh so I want to keep the theme to this random uh sampling of Johnson-Lindenstrauss in some sense
Dialogue: 0,0:44:40.85,0:44:54.14,Default,,0000,0000,0000,,we use that, that told us that there's a random, if you pick a random matrix R then the length of x for every vector x, R times A times x
Dialogue: 0,0:44:54.14,0:45:05.68,Default,,0000,0000,0000,,now in the old setup we multiply R by a vector, well A times x is a vector right? so R times A times x the length is estimate, is a good estimate of the length of A times x
Dialogue: 0,0:45:05.68,0:45:10.27,Default,,0000,0000,0000,,I really want the length of A times x but it's enough to do this
Dialogue: 0,0:45:10.27,0:45:15.46,Default,,0000,0000,0000,,R makes it much smaller, we erased that, R makes it much smaller so we can communicate better
Dialogue: 0,0:45:15.46,0:45:25.38,Default,,0000,0000,0000,,but now if this is only true for one x then it wouldn't help us, but if it were true for every x, this is the optimization problem that I first pointed out
Dialogue: 0,0:45:25.38,0:45:37.20,Default,,0000,0000,0000,,if it were true for every x we could just solve the problem on R times A instead of for A right? because x I find for R times A should also be good for A if this kind of relation is true for every A
Dialogue: 0,0:45:37.20,0:45:46.36,Default,,0000,0000,0000,,and in fact it turns out well the number of x's is exponential in n but Johnson-Lindenstrauss, now this is one more step
Dialogue: 0,0:45:46.36,0:45:54.47,Default,,0000,0000,0000,,gives us a low enough probability of failure for one x that we get this actually works for every x simultaneously, for all the x's simultaneously
Dialogue: 0,0:45:54.47,0:46:03.75,Default,,0000,0000,0000,,not only is this true for one x it's true for every x at the same time, ok that's asking for a lot because there are many many x's but that's given to us
Dialogue: 0,0:46:03.75,0:46:13.37,Default,,0000,0000,0000,,so there's one random matrix R so again the picture was, oh I think I have a picture on the next slide
Dialogue: 0,0:46:13.67,0:46:24.34,Default,,0000,0000,0000,,there's one random matrix R so that R times A times x and A times x are similar length for every x, where big n is not very big
Dialogue: 0,0:46:24.34,0:46:36.58,Default,,0000,0000,0000,,so a picture here would make it clearer, here is a big matrix, here is R times A which is far fewer rows for the same number of columns but has a very nice property
Dialogue: 0,0:46:36.58,0:46:44.33,Default,,0000,0000,0000,,that for any vector x that you can think of the length of A applied to x and the length of RA applied to x are similar
Dialogue: 0,0:46:44.33,0:46:53.20,Default,,0000,0000,0000,,that's quite a striking property, so anything you want to do with A you can do with R times A because all these lengths are preserved
Dialogue: 0,0:46:53.20,0:47:05.92,Default,,0000,0000,0000,,so that finishes PCA which is a very brief thing, now the area of distributed data problems, as far as big data goes a big part of big data is that data is distributed
Dialogue: 0,0:47:05.92,0:47:13.26,Default,,0000,0000,0000,,so there are many problems that people are currently studying, how to do linear non-linear problems, optimization problems is an important thing in this model
Dialogue: 0,0:47:13.26,0:47:17.53,Default,,0000,0000,0000,,machine learning in a distributed setting is an active area
Dialogue: 0,0:47:17.53,0:47:23.73,Default,,0000,0000,0000,,there are dynamic versions of these questions you can ask where the data is subject to updates
Dialogue: 0,0:47:23.73,0:47:28.93,Default,,0000,0000,0000,,there are some papers here, I point out only one reference but there are many references here
Dialogue: 0,0:47:29.62,0:47:44.50,Default,,0000,0000,0000,,ok good, I want to do one more topic uh I picked this topic partly because it relates to it, partly because it's very beautiful recent work and one of the co-authors is in Bangalore actually so I thought it'd be nice to do
Dialogue: 0,0:47:44.50,0:47:51.78,Default,,0000,0000,0000,,so this is called sparsification which is another twist on sampling
Dialogue: 0,0:47:51.78,0:48:02.62,Default,,0000,0000,0000,,now we already saw that if you have a matrix A which is big we can compress it by using a random matrix A in front, you get R times A right?
Dialogue: 0,0:48:02.62,0:48:09.12,Default,,0000,0000,0000,,now you can think about RA as combinations of the rows of A right? RA is the combination of the rows of A
Dialogue: 0,0:48:10.04,0:48:28.62,Default,,0000,0000,0000,,here is a, so instead of RA, so combining the rows of A is it possible to just take a sample, a subset of rows, not combinations but a subset of rows maybe with weights and make that do this job?
Dialogue: 0,0:48:31.23,0:48:34.63,Default,,0000,0000,0000,,how about graphs? because graphs are a special case of the matrices
Dialogue: 0,0:48:34.63,0:48:38.62,Default,,0000,0000,0000,,so here is a picture, a description of what a graph looks like
Dialogue: 0,0:48:38.62,0:48:41.50,Default,,0000,0000,0000,,now I've gone to columns and to rows excuse me
Dialogue: 0,0:48:41.50,0:48:46.17,Default,,0000,0000,0000,,so here's a graph, it's represented by this edge [nor?] adjacency matrix
Dialogue: 0,0:48:46.17,0:48:53.87,Default,,0000,0000,0000,,so edge 1 goes from A to B so I put a plus one on A, minus one on B, there's vertix A and vertix B
Dialogue: 0,0:48:53.87,0:49:02.97,Default,,0000,0000,0000,,similarly edge 2 goes from B to C, and that's that edge, B to C and so on
Dialogue: 0,0:49:02.97,0:49:10.65,Default,,0000,0000,0000,,I put down this matrix and uh here's a cut which cuts the graph into two pieces
Dialogue: 0,0:49:10.88,0:49:16.12,Default,,0000,0000,0000,,I look at all the edges going across the cut, these three edges now going across that cut
Dialogue: 0,0:49:16.12,0:49:20.79,Default,,0000,0000,0000,,it turns out, and this is a calculation that you don't have to do at this point
Dialogue: 0,0:49:20.79,0:49:33.52,Default,,0000,0000,0000,,it turns out I can represent the cut by a vector which puts ones on the vertices on the left bank, these two a and d, and zero for the vertices on the right bank, b and c of the cut
Dialogue: 0,0:49:33.52,0:49:41.71,Default,,0000,0000,0000,,and if I take the length of the vector vA squared that's a [slice?] of the cut, it's always true but you have to prove this
Dialogue: 0,0:49:41.71,0:49:47.70,Default,,0000,0000,0000,,so from this we can formulate the following problem which is exactly the same as what I had before
Dialogue: 0,0:49:47.70,0:49:55.62,Default,,0000,0000,0000,,so I want now but a subset of edges, a subset of columns, so that the length of vB and the length of vA are close
Dialogue: 0,0:49:55.62,0:49:58.63,Default,,0000,0000,0000,,I don't want the combination, I want a subset
Dialogue: 0,0:49:58.86,0:50:06.08,Default,,0000,0000,0000,,so in pictures this is for graphs here is a graph on ten vertices which has all the [tensions or tensials?] to edges, so it's a lot of edges
Dialogue: 0,0:50:06.41,0:50:11.99,Default,,0000,0000,0000,,if n vertices or have n choose two edges, that's many many edges
Dialogue: 0,0:50:11.99,0:50:19.43,Default,,0000,0000,0000,,can I sample a subset of edges, so in this case I have I think fifteen edges instead of [tensials?] two
Dialogue: 0,0:50:19.43,0:50:29.91,Default,,0000,0000,0000,,and perhaps I weight them, I make them thicker, so that I want to guarantee that any cut I make
Dialogue: 0,0:50:29.91,0:50:34.24,Default,,0000,0000,0000,,so there are n vertices I can cut them into two to the n possible ways
Dialogue: 0,0:50:34.24,0:50:39.23,Default,,0000,0000,0000,,any cut I make here, and I make here, has the same value
Dialogue: 0,0:50:39.23,0:50:47.59,Default,,0000,0000,0000,,I take the total number of edges crossing this cut, I take the total weight of edges crossing this cut, they must be the same, roughly the same
Dialogue: 0,0:50:47.59,0:50:58.91,Default,,0000,0000,0000,,so again I want to sparsify the graph, choose a subset of edges, so that every cut here has roughly the same weight as every, the corresponding cut here
Dialogue: 0,0:50:58.91,0:51:07.87,Default,,0000,0000,0000,,that's a problem we might want to solve and um here's a problem with this setup
Dialogue: 0,0:51:07.87,0:51:10.64,Default,,0000,0000,0000,,here is a pathological graph
Dialogue: 0,0:51:10.64,0:51:19.26,Default,,0000,0000,0000,,if you have a graph which is very dense here and very dense here and only one edge connecting the two this is called a dumbell graph
Dialogue: 0,0:51:19.26,0:51:23.97,Default,,0000,0000,0000,,here is a cut just cutting it into two pieces, there is one edge crossing it
Dialogue: 0,0:51:24.100,0:51:32.39,Default,,0000,0000,0000,,I'd better get this edge in my sample otherwise I'll get a zero instead of one, that's not good, that's not good relative error right?
Dialogue: 0,0:51:32.83,0:51:36.12,Default,,0000,0000,0000,,zero is not within relative error one percent or one
Dialogue: 0,0:51:36.31,0:51:40.07,Default,,0000,0000,0000,,so if I want to sparsify this graph I'd better always pick the same
Dialogue: 0,0:51:40.20,0:51:43.68,Default,,0000,0000,0000,,so uniform sampling of the edges will not do
Dialogue: 0,0:51:43.87,0:51:49.22,Default,,0000,0000,0000,,I have to do something else and a beautiful theorem of Spielmann, Teng and Srivatsava
Dialogue: 0,0:51:49.22,0:51:51.62,Default,,0000,0000,0000,,Nikhil Srivatsava is at Microsoft in Bangalore
Dialogue: 0,0:51:51.63,0:51:57.34,Default,,0000,0000,0000,,uh says that uh if you have an n by m matrix where m is larger than n
Dialogue: 0,0:51:57.34,0:52:04.61,Default,,0000,0000,0000,,you could, there is a probability distribution it cannot be uniform I have to make sure I put a high probability on this edge
Dialogue: 0,0:52:04.61,0:52:15.35,Default,,0000,0000,0000,,there is a probability distribution we can put on the rows of A so that if we do IID sampling of a certain number of rows
Dialogue: 0,0:52:15.35,0:52:20.44,Default,,0000,0000,0000,,only on log n rows, a small number of rows think of it, according to the distributions
Dialogue: 0,0:52:20.44,0:52:24.22,Default,,0000,0000,0000,,then for every x length of Ax and length of Bx are the same
Dialogue: 0,0:52:24.22,0:52:29.55,Default,,0000,0000,0000,,and that turns out answers the cut problem for a reason I won't actually describe
Dialogue: 0,0:52:29.55,0:52:37.34,Default,,0000,0000,0000,,and if A actually came from a graph like the last one, sampling probabilities are proportional to the electrical resistances
Dialogue: 0,0:52:37.34,0:52:44.35,Default,,0000,0000,0000,,when you view the graph as a resister network and in fact sampling can be done in a nearly linear time
Dialogue: 0,0:52:44.35,0:52:56.10,Default,,0000,0000,0000,,so again I have a graph I must sample a certain set of edges, a small set of edges, so that every cut is correctly represented and they say we can do that provided you choose the right probabilities
Dialogue: 0,0:52:56.10,0:53:02.89,Default,,0000,0000,0000,,you cannot do that if you choose uniform probability right? and the probabilities are proportional to electrical resistences
Dialogue: 0,0:53:02.89,0:53:12.79,Default,,0000,0000,0000,,so this theorem they proved about five years back, there's been a lot of work on this area and one of the people who's worked on this area is Ramesh Hariharan
Dialogue: 0,0:53:12.79,0:53:23.50,Default,,0000,0000,0000,,who is our next speaker in this series he will talk next month, perhaps not about this but something else uh but he is also from Bangalore so he has done quite a bit of work on this also
Dialogue: 0,0:53:23.50,0:53:38.08,Default,,0000,0000,0000,,now this led to something quite unexpected uh but proved about two months ago and that's the uh which also Srivatsava was, Nikhil Srivatsava was one of the co-authors
Dialogue: 0,0:53:38.08,0:53:47.51,Default,,0000,0000,0000,,there was a beautiful result that settles a classic mathematics problem and the problem is actually very important in quantum theory as well as operator theory
Dialogue: 0,0:53:47.51,0:53:53.15,Default,,0000,0000,0000,,well it's one of those few things which can be stated very simply to do with vectors
Dialogue: 0,0:53:53.15,0:53:59.03,Default,,0000,0000,0000,,I won't be able to describe what connection it has to the previous theorem but it does, it's actually connected to the previous theorem
Dialogue: 0,0:53:59.03,0:54:05.84,Default,,0000,0000,0000,,so if big data you thought was only going to let you handle big data well here is something very fundamental that came out of
Dialogue: 0,0:54:05.84,0:54:11.91,Default,,0000,0000,0000,,not necessarily looking staring at big data but something that has to do with compression and sparsification
Dialogue: 0,0:54:11.91,0:54:21.62,Default,,0000,0000,0000,,so the theorem here says the following you have a finite set of vectors, this is going somewhat far afield but I want to tell you this theorem because it's a spectacular achievement right?
Dialogue: 0,0:54:21.62,0:54:28.51,Default,,0000,0000,0000,,um by these people, so if you have a finite set of vectors which are in what's called inertial position
Dialogue: 0,0:54:28.51,0:54:33.11,Default,,0000,0000,0000,,inertial position means the set of vectors is cheap
Dialogue: 0,0:54:33.11,0:54:43.42,Default,,0000,0000,0000,,inertial position means you take any vector x and sum of the squares of x dot v, dot product over the set that's exactly the length squared
Dialogue: 0,0:54:43.42,0:54:55.40,Default,,0000,0000,0000,,so you may think of as this as the energy of x and the direction v, so this says the energy of x together along the directions, and t is exactly the length squared
Dialogue: 0,0:54:55.40,0:55:01.42,Default,,0000,0000,0000,,so that needs to be true and no vector should be big, so vectors are all small in length
Dialogue: 0,0:55:01.42,0:55:08.56,Default,,0000,0000,0000,,then you can always partition this set of vectors into two sets that are about half inertial
Dialogue: 0,0:55:08.56,0:55:19.24,Default,,0000,0000,0000,,so this was about x squared, each set is about half inertial so for every x I take the sum of vectors in the first part of the energy squared in the direction
Dialogue: 0,0:55:19.24,0:55:28.93,Default,,0000,0000,0000,,that's approximately x squared over two for the first set, so it must be true for the second set also right? because the total is x squared
Dialogue: 0,0:55:28.93,0:55:34.67,Default,,0000,0000,0000,,so it turns out again this settles actually a very long standing problem in operative theory
Dialogue: 0,0:55:34.67,0:55:40.94,Default,,0000,0000,0000,,now this has something to do with graph sparsification in fact that was their starting point
Dialogue: 0,0:55:40.94,0:55:53.36,Default,,0000,0000,0000,,this actually says that not only can you cut up a graph, can you sparsify a graph, you can actually split it up into very sparse pieces
Dialogue: 0,0:55:53.36,0:56:03.27,Default,,0000,0000,0000,,you can split it up into many pieces each of which is sparse, but while there's some conditionals that's what this theorem ends up saying in a way that I won't be able to completely describe
Dialogue: 0,0:56:03.27,0:56:14.90,Default,,0000,0000,0000,,but I believe that's all so I'm done, we are now onto questions if you want [audience applauds]
Dialogue: 0,0:56:36.76,0:56:40.24,Default,,0000,0000,0000,,{moderator} so we're open for questions now
Dialogue: 0,0:56:40.96,0:56:53.11,Default,,0000,0000,0000,,{audience} sir you are taking the [unintelligible, just one word] approximately the length of the x and the R x so [our theory? (two more words)] how the x alone with [several words unintelligible] what would be the order of that approximately
Dialogue: 0,0:56:53.11,0:57:07.04,Default,,0000,0000,0000,,{Kannan} oh in terms of epsilon? so in terms of the [vertiver?] epsilon required the number of rows will grow as one over epsilon squared, so not too bad, and in terms of n it's only [longer or logr or log R?]
Dialogue: 0,0:57:07.58,0:57:20.09,Default,,0000,0000,0000,,{audience} [first sentence not at mike, unintelligible] ...true for all time that we have used computers right? why is that called big data?
Dialogue: 0,0:57:20.50,0:57:29.97,Default,,0000,0000,0000,,{Kannan} no, ok so uh good point. so it's true, it has been true in practice for a long time, but in theory the paradigm we had for algorithms was
Dialogue: 0,0:57:29.97,0:57:36.49,Default,,0000,0000,0000,,I store my graph for my matrix or my input for any problem in RAM and then I can quickly access various things
Dialogue: 0,0:57:36.49,0:57:43.83,Default,,0000,0000,0000,,we never, well not never, we almost never worried about access problems if data was not stored [in area?]
Dialogue: 0,0:57:43.83,0:57:48.44,Default,,0000,0000,0000,,we did sometimes, there is some literature, but we didn't sort of seriously worry about that
Dialogue: 0,0:57:48.44,0:57:53.46,Default,,0000,0000,0000,,but big data means even in theory we better seriously worry about that somehow
Dialogue: 0,0:57:53.46,0:57:56.22,Default,,0000,0000,0000,,{moderator} one other person
Dialogue: 0,0:57:56.22,0:58:09.62,Default,,0000,0000,0000,,{audience} sampling, is there any connection theoretically to this [micro?] sampling about minimum frequency of sampling equal and in big data as to theoretical work on is there something [ecoland?]
Dialogue: 0,0:58:09.62,0:58:15.82,Default,,0000,0000,0000,,{Kannan} yeah so you say our optimal sampling rate that we need, we don't often know optimal answer
Dialogue: 0,0:58:15.82,0:58:21.79,Default,,0000,0000,0000,,[niquis?] was nice to help me know the optimal answer, we often don't know the optimal answer, we know lower and upper bounds which don't quite meet
Dialogue: 0,0:58:21.79,0:58:24.84,Default,,0000,0000,0000,,but in a few cases we do know the optimal answer
Dialogue: 0,0:58:24.84,0:58:26.44,Default,,0000,0000,0000,,{audience} well what is that kind of \N
Dialogue: 0,0:58:26.44,0:58:34.28,Default,,0000,0000,0000,,{Kannan} is work in that kind of thing? the lower ones are very hard to come by that's the problem [niquis?] sampling is something else, lower bounds were also possible there
Dialogue: 0,0:58:34.97,0:58:39.48,Default,,0000,0000,0000,,{audience} in your presentation you are dealing with [well in?] numerical data and maybe you are not talking of
Dialogue: 0,0:58:39.70,0:58:48.77,Default,,0000,0000,0000,,{Kannan} so good point good point, throughout the talk I only focus on numerical data partly to spite my friends who focused on Boolean variables for too long {audience laughs}
Dialogue: 0,0:58:48.77,0:58:56.17,Default,,0000,0000,0000,,but partly, partly is the following, uh machine learning in general or other subjects have feature vectors
Dialogue: 0,0:58:56.17,0:59:03.65,Default,,0000,0000,0000,,feature vectors can have Boolean fields but generally is much nicer to deal with real data because there's a lot more structure
Dialogue: 0,0:59:03.65,0:59:13.43,Default,,0000,0000,0000,,{audience} but there's some of this uh uh you know text or even pictures really or audio or all that, can they be converted into numerical
Dialogue: 0,0:59:13.43,0:59:22.07,Default,,0000,0000,0000,,{Kannan} actually that's a very good question, so for instance if you have zero one, so for things occurring or not occurring zero or one
Dialogue: 0,0:59:22.07,0:59:32.08,Default,,0000,0000,0000,,now the numerical distance squared between those two is of course the number of flips or the dot product for instance even though they are Boolean
Dialogue: 0,0:59:32.08,0:59:37.46,Default,,0000,0000,0000,,the dot product is the number of common so there are connections like this which can be spotted
Dialogue: 0,0:59:37.46,0:59:43.86,Default,,0000,0000,0000,,now it turns out, actually it is a very good point you raised, it turns out for instance for the vet matrix you have the point of making some hypertext links
Dialogue: 0,0:59:43.86,0:59:50.50,Default,,0000,0000,0000,,so uh there's a one in the ideal position if the [ite?] were a page linked to [jates?] web page
Dialogue: 0,0:59:50.50,0:59:57.100,Default,,0000,0000,0000,,now this is purely a Boolean thing but you take that matrix and do eigenvalue analysis of it, that's how you get page rank
Dialogue: 0,0:59:57.100,1:00:04.41,Default,,0000,0000,0000,,the reason it works is distance is squared and dot products make sense as if they were real even though they're Booleans
Dialogue: 0,1:00:15.69,1:00:23.46,Default,,0000,0000,0000,,{audience} sir I've heard a lot of big data talks and oftentimes they're about the systems aspect of big data and it was fascinating to hear the foundational aspects of big data
Dialogue: 0,1:00:23.46,1:00:25.89,Default,,0000,0000,0000,,{Kannan} completely removed from reality I'm sorry {audience laughs} but yeah that's true
Dialogue: 0,1:00:25.89,1:00:29.84,Default,,0000,0000,0000,,{audience} sir where does theory and practice meet and how do they meet?
Dialogue: 0,1:00:29.84,1:00:40.36,Default,,0000,0000,0000,,{Kannan} actually that's a good question so for instance uh this idea of using Johnson-Lindenstrauss right? really in a way comes from theory
Dialogue: 0,1:00:40.36,1:00:51.10,Default,,0000,0000,0000,,ok, so, I mean in fact, I probably want to claim I'm not probably even sure, I would I would think theoreticians are the first to think of random [theoreticians?] because they're {sermos arah?}
Dialogue: 0,1:00:51.10,1:00:59.20,Default,,0000,0000,0000,,ok so we prove this theorem that says if you take a number of rows, hundred log n over epsilon squared then I can prove that this works
Dialogue: 0,1:00:59.20,1:01:05.43,Default,,0000,0000,0000,,well in practice that's too big but however it guides practice in a sense that the whole method comes from theory
Dialogue: 0,1:01:05.43,1:01:11.07,Default,,0000,0000,0000,,now in practice you can go and say I'll use only five log n over epsilon and that works beautifully
Dialogue: 0,1:01:11.07,1:01:18.93,Default,,0000,0000,0000,,so, ok, we need two things, we need empirical studies to translate this theory into practice, because the bounds may be prohibitive
Dialogue: 0,1:01:18.93,1:01:24.45,Default,,0000,0000,0000,,we also need the other way mechanism, in practice people use certain heuristics or algorithms
Dialogue: 0,1:01:24.45,1:01:30.10,Default,,0000,0000,0000,,I mean people are happy using the k means algorithm for anything they want, it always works
Dialogue: 0,1:01:30.10,1:01:35.12,Default,,0000,0000,0000,,theoreticians are puzzled that it always works and we are still trying to prove that there are cases where it works
Dialogue: 0,1:01:35.12,1:01:41.25,Default,,0000,0000,0000,,but in trying to prove that maybe we come up with twists that are interesting in practice so it's probably both ways that it would go
Dialogue: 0,1:01:41.25,1:01:45.42,Default,,0000,0000,0000,,but that's how all this vector stuff came from theory in some sense
Dialogue: 0,1:01:46.08,1:01:52.25,Default,,0000,0000,0000,,{audience} so the problem on [insulate?] comparative analysis on distributed data so what are the objectives there?
Dialogue: 0,1:01:52.25,1:01:55.99,Default,,0000,0000,0000,,what are the things that you want to compress it to within maybe a target rate?
Dialogue: 0,1:01:55.99,1:02:00.86,Default,,0000,0000,0000,,and then of all possible ways, all those R matrices, all possible ways by which you can compress it to [dot or target?] size
Dialogue: 0,1:02:00.86,1:02:07.32,Default,,0000,0000,0000,,so are there uh analytical lower bounds on the amount of communication that is needed and which is possible?
Dialogue: 0,1:02:07.32,1:02:15.13,Default,,0000,0000,0000,,{Kannan} yes there are lower bounds actually and it turns out this algorithm had meets the upper and lower
Dialogue: 0,1:02:15.13,1:02:21.03,Default,,0000,0000,0000,,this is one of these algorithms which actually is tight with the lower bounds, we're there for that problem already
Dialogue: 0,1:02:21.30,1:02:31.22,Default,,0000,0000,0000,,so I can give you references but roughly for, well I won't tell you the exact lower bound but yeah in this case the lower and upper bounds meet
Dialogue: 0,1:02:32.34,1:02:36.48,Default,,0000,0000,0000,,so also for the v i to the k there are lower bounds there are lower bounds that meet the upper bounds
Dialogue: 0,1:02:42.78,1:02:50.31,Default,,0000,0000,0000,,{audience} this [bipartite?] graph you draw graph you draw you say that each server has its own respective information but how do you combine them finally together?
Dialogue: 0,1:02:51.06,1:02:56.70,Default,,0000,0000,0000,,{Kannan} so that's still the important sampling, so each of the, what the model of what I said was each server has important sampling
Dialogue: 0,1:02:56.70,1:03:05.59,Default,,0000,0000,0000,,just on its own events, and somehow that's enough it turns out with some exchanges and rejection sampling which I didn't go into
Dialogue: 0,1:03:05.59,1:03:11.40,Default,,0000,0000,0000,,that's enough to do something which is as if importance sampling on the whole thing with the sum
Dialogue: 0,1:03:12.02,1:03:15.65,Default,,0000,0000,0000,,so there's some work to be shown there, I didn't show it
Dialogue: 0,1:03:17.82,1:03:21.96,Default,,0000,0000,0000,,I'll move to the other side, no I won't move to the other side {laughter}
Dialogue: 0,1:03:22.72,1:03:32.09,Default,,0000,0000,0000,,{audience} ok so on this geometric embedding of symbolic uh problems and so on uh you know of course in logic you can also embed many problems in logic
Dialogue: 0,1:03:32.30,1:03:37.13,Default,,0000,0000,0000,,is there a sort of a sampling theorem there that is striking in logic?
Dialogue: 0,1:03:39.11,1:03:41.15,Default,,0000,0000,0000,,{Kannan} for logic uh?\N
Dialogue: 0,1:03:41.43,1:03:43.91,Default,,0000,0000,0000,,{audience} so for theorem proving or for [combine this/compactness?]
Dialogue: 0,1:03:44.89,1:03:50.35,Default,,0000,0000,0000,,{Kannan} actually I don't know, uh um, the difficulty for theorem proving you want the exact thing right?
Dialogue: 0,1:03:50.35,1:03:54.48,Default,,0000,0000,0000,,so the closest thing I can think of at the programming languages people, to verify a problem
Dialogue: 0,1:03:54.48,1:03:58.35,Default,,0000,0000,0000,,I mean you would like a theorem that says the problem is always true which is often difficult to prove
Dialogue: 0,1:03:58.35,1:04:05.66,Default,,0000,0000,0000,,they do have sampling methods of generating random inputs on which it's enough to check so you can assert with high probability
Dialogue: 0,1:04:05.66,1:04:13.47,Default,,0000,0000,0000,,even sometimes that's not enough there also there are ways of coming up but not full fledged frugal ways but they have heuristic ways of coming up with a test state
Dialogue: 0,1:04:13.47,1:04:22.33,Default,,0000,0000,0000,,{audience} yeah but there could be other things like uh finding deep resolvants or steps in theorem proving \N{Kannan} that's right
Dialogue: 0,1:04:22.33,1:04:30.36,Default,,0000,0000,0000,,{audience} that could come from\N{Kannan} yeah maybe some steps can be subject to sampling but I don't know much of anything along those lines actually
Dialogue: 0,1:04:43.24,1:04:53.62,Default,,0000,0000,0000,,{audience (also first and only female voice to speak so far)} look this is really not uh um [deduhdinesys?] question as such uh um think of me as a layman asking question to uh you know professor
Dialogue: 0,1:04:53.62,1:04:58.92,Default,,0000,0000,0000,,so to what extent all these algorithms have been implemented?
Dialogue: 0,1:05:00.14,1:05:07.01,Default,,0000,0000,0000,,um say I'm from a softer background and I'm trying to implement these things with respect to a [field?] problem
Dialogue: 0,1:05:07.30,1:05:13.14,Default,,0000,0000,0000,,to what extent are these implemented so that they're useable?
Dialogue: 0,1:05:13.36,1:05:19.100,Default,,0000,0000,0000,,{Kannan} ok uh I don't know a lot about where the streaming algorithms have been implemented but perhaps they've been, they're around
Dialogue: 0,1:05:19.100,1:05:31.24,Default,,0000,0000,0000,,the matrix algorithm is actually, people have begun to study in these statistical architectures I know you're familiar with MapReduce that's one particular discrete uh distributed architecture
Dialogue: 0,1:05:31.24,1:05:41.93,Default,,0000,0000,0000,,of how the machines interact which Google built and there have now been maybe two or three year old studies of the matrix algorithms
Dialogue: 0,1:05:41.93,1:05:47.87,Default,,0000,0000,0000,,um streaming I would guess is implemented but I don't know precisely
Dialogue: 0,1:05:47.87,1:05:52.56,Default,,0000,0000,0000,,uh the matrix algorithm you don't want to solve it's uh you have large matrices and you want to do this
Dialogue: 0,1:05:55.56,1:06:06.09,Default,,0000,0000,0000,,oh and clustering is another example where which people have implemented and k means, I didn't talk about that, in in MapReduce and other settings there has been some study
Dialogue: 0,1:06:08.39,1:06:14.49,Default,,0000,0000,0000,,{moderator} like just to add to that other question to that actually the one of the streams of activity here is to take
Dialogue: 0,1:06:14.49,1:06:20.67,Default,,0000,0000,0000,,very well established algorithms like clustering, single value decomposition, that have been around for awhile
Dialogue: 0,1:06:20.67,1:06:26.56,Default,,0000,0000,0000,,but put them on a sort of more firm foundational basis where you understand when they work, when they don't
Dialogue: 0,1:06:26.56,1:06:31.49,Default,,0000,0000,0000,,how you can sample them to make them faster etcetera, that's hat's clearly one of the streams
Dialogue: 0,1:06:31.49,1:06:39.76,Default,,0000,0000,0000,,along that line I had a question which would be um are there, are there methods that are very well established like analog there is [really?] clustering etcetera?
Dialogue: 0,1:06:39.76,1:06:48.72,Default,,0000,0000,0000,,which clearly have an impact in practice but do not have a mathematical foundation yet which we all need to look at, examples of those?
Dialogue: 0,1:06:48.72,1:06:55.67,Default,,0000,0000,0000,,{Kannan} yeah uh certainly, graphical models, relief propagation for examples, there there's some theory but not enough
Dialogue: 0,1:06:55.67,1:07:03.70,Default,,0000,0000,0000,,uh deep learning is a big, so there's a subject called deep learning that um machine learning people have devised over the last maybe ten years
Dialogue: 0,1:07:03.70,1:07:09.79,Default,,0000,0000,0000,,uh it is supposed to be spectacular. no theory establishing that as such
Dialogue: 0,1:07:09.79,1:07:16.31,Default,,0000,0000,0000,,topic modeling is another area that, one of our future speakers Chiranjib works on
Dialogue: 0,1:07:16.31,1:07:24.33,Default,,0000,0000,0000,,is another area where there is a lot of heuristic methods, we perhaps have few proofs and uh things need to be proved along those lines so
Dialogue: 0,1:07:24.33,1:07:30.15,Default,,0000,0000,0000,,I mean I mean there are a lot, so, so I'd say, so both ways right, one question was how to take theory to practice
Dialogue: 0,1:07:30.15,1:07:36.88,Default,,0000,0000,0000,,but there's the other way that practical people, I mean they are generally happier people because whatever they use it seems to work {audience laughter}
Dialogue: 0,1:07:36.88,1:07:43.10,Default,,0000,0000,0000,,but we can't prove anything I mean it makes us unhappy but that might be one, another way to go also
Dialogue: 0,1:07:49.18,1:07:55.74,Default,,0000,0000,0000,,{audience} there is another basic question from my side like uh even if we go with all of these algorithm uh being a kind of uh
Dialogue: 0,1:07:55.74,1:08:03.09,Default,,0000,0000,0000,,if I look at from the softer [bound?] of a few [of our applications?] [bound?] a few, how are you going to get the data into this [theorem?] like with the matrices theorem?
Dialogue: 0,1:08:03.09,1:08:06.23,Default,,0000,0000,0000,,so I can implement to the application side also?
Dialogue: 0,1:08:06.23,1:08:13.43,Default,,0000,0000,0000,,{Kannan} ok so that's a a question, more generally you might have a, you might have big data that you have to annotate
Dialogue: 0,1:08:13.43,1:08:18.20,Default,,0000,0000,0000,,I mean that's a huge problem, I mean in uh biology, in astronomy, and so on, there's a huge problem
Dialogue: 0,1:08:18.20,1:08:22.14,Default,,0000,0000,0000,,I don't know whether there is a sort of simple solution for that
Dialogue: 0,1:08:22.14,1:08:25.20,Default,,0000,0000,0000,,obviously not because every domain has to do its [unintelligible]
Dialogue: 0,1:08:25.20,1:08:30.63,Default,,0000,0000,0000,,now as far as these specific things how to convert them into vectors, these more or less come as these logs right?
Dialogue: 0,1:08:30.63,1:08:35.18,Default,,0000,0000,0000,,it was just messages, source and for instance, source and number of packets
Dialogue: 0,1:08:35.18,1:08:39.46,Default,,0000,0000,0000,,it's very easy to run through the log and build your vectors, that's what we were saying
Dialogue: 0,1:08:39.46,1:08:47.45,Default,,0000,0000,0000,,so in some of these instances it is easy, but more generally the problem of how to make data not only machine readable
Dialogue: 0,1:08:47.45,1:08:55.26,Default,,0000,0000,0000,,but intelligible sort of is, is a big problem for which I don't think I have a solution
Dialogue: 0,1:09:00.08,1:09:02.79,Default,,0000,0000,0000,,so this is the quiet, I can come to the quiet side and then
Dialogue: 0,1:09:03.16,1:09:06.74,Default,,0000,0000,0000,,{moderator} all the questions, were almost on that side yes, the questions come from that side
Dialogue: 0,1:09:06.91,1:09:08.92,Default,,0000,0000,0000,,{Kannan} unless we are done with time
Dialogue: 0,1:09:09.14,1:09:12.94,Default,,0000,0000,0000,,{moderator} any other questions?\N{silence}
Dialogue: 0,1:09:17.49,1:09:27.36,Default,,0000,0000,0000,,{moderator} ok so if there are no more questions then we'll um thank Ravi again for uh {applause} very impressive talk
Dialogue: 0,1:09:28.33,1:09:30.27,Default,,0000,0000,0000,,{Kannan} oh thank you
Dialogue: 0,1:09:33.65,1:09:35.99,Default,,0000,0000,0000,,{presenter} this is very small gift for a very big lecture \N{laughter}\N
Dialogue: 0,1:09:36.95,1:09:39.43,Default,,0000,0000,0000,,{Kannan} well the next one will say big data on it \N{laughter}
Dialogue: 0,1:09:39.67,1:09:44.26,Default,,0000,0000,0000,,{off camera} I'd just like to mention and those of you who don't know Ravi is writing, he mentioned at the beginning
Dialogue: 0,1:09:44.26,1:09:47.72,Default,,0000,0000,0000,,a foundational book on this topic and uh some of it might be available on the web uh
Dialogue: 0,1:09:47.72,1:09:49.48,Default,,0000,0000,0000,,{Kannan} oh it's available to download yeah
Dialogue: 0,1:09:49.48,1:09:56.83,Default,,0000,0000,0000,,{off camera} at least even for me who's been in this community for awhile to have all of these different algorithms strung together
Dialogue: 0,1:09:56.83,1:10:05.80,Default,,0000,0000,0000,,under a very nice umbrella that relates now to the, to the things in the world is, is certainly, certainly very interesting, very you know foundational
Dialogue: 0,1:10:05.80,1:10:08.47,Default,,0000,0000,0000,,so I would encourage all of you to take a look at it and uh
Dialogue: 0,1:10:08.47,1:10:11.14,Default,,0000,0000,0000,,{Kannan} yeah please feel free to download it and uh use it anyway
Dialogue: 0,1:10:11.14,1:10:15.05,Default,,0000,0000,0000,,{off camera} thanks Ravi again and uh I'll turn it over to Professor
Dialogue: 0,1:10:15.05,1:10:23.34,Default,,0000,0000,0000,,ok so we do have uh coffee and things right next door so please help yourself there
Dialogue: 0,1:10:23.34,1:10:28.74,Default,,0000,0000,0000,,and then we'll meet again in a month's time for the next lecture and the details will be up on the web
Dialogue: 0,1:10:28.74,1:10:32.30,Default,,0000,0000,0000,,{Kannan} so thanks for coming and thanks thanks for the questions and...\N{applause}