Pages

Showing posts with label Programming Methodology-Lecture23 Instructor (Mehran Sahami). Show all posts
Showing posts with label Programming Methodology-Lecture23 Instructor (Mehran Sahami). Show all posts

Saturday, 11 June 2011

Programming Methodology-Lecture23 Instructor (Mehran Sahami)


Programming Methodology-Lecture23
Instructor (Mehran Sahami): So welcome to the 106A lecture just before
Thanksgiving. I thought attendance might not be spectacular today. People are still
filtering in, but it’s good to see so many of you here, though. I’m sure some of you have
already headed home.
You may be wondering who I am. I would have thought before we got back mid-quarter
evaluations that you stood a chance of recognizing me as the TA of your class, but the
comment of more than half of the people who responded to the question of how is Ben
doing was, “I don’t know Ben. I’ve never interacted with Ben. So I assume he’s doing a
great job.”
Most of you jumped to that conclusion. But – so I sort of chased at this, but not much. I
realize, though, that I would enjoy the opportunity to develop a little more exposure. And
so Maron and I decided to switch roles for today. In fact, he’s making copies as we speak,
and there’s going to be one handout that’s sort of pertinent to what you’ll be doing for
[inaudible] the next assignment. And he should be back any time, but I certainly can’t
blame him for being – running a little late since I was supposed to make the copies before
class myself. So anyway, in my capacity as Maron, I don’t hope to be as spontaneous or
as funny as he is, but I do hope to talk about something that is a little bit off of the sort of
charted path of CS106A’s curriculum.
So one way of describing the things that you’re learning in 106A is that you’re learning
how to take the ways that you already know of doing things that may take you some time
and maybe error-prone in the doing of them and tell a computer precisely how to do them
so that you reap the benefits of a machine being able to execute those instructions. And
they’ll be faster. There’ll be fewer errors as long as you get the instructions right in the
first place. And that’s got to be an exhilarating feeling. It’s got to be empowering, or at
least it was when I was sitting in your place as a freshman, to know that, all of a sudden,
if you have a good idea – if you can come up with a procedure for solving a problem that
you’ve always had and tell a computer how to do it, then the computer will be able to do
that over and over and over again. All right? So that’s 106A in a nutshell. So it’s sort of –
it’s not a perfect metaphor. What does it mean to run breakout by hands? I don’t know.
So what I – the distinction I’m trying to make between what you’ve been learning so far
and what you will learn, hopefully, in this lecture – and certainly if you go on to CS106B
– is that there are things that computers can do that you couldn’t possibly have simulated
by hand. Computers can handle so much more data than you could ever manage to sort
through on your own manually that it’s worth you’re learning a bit about how to make
them do that – how to think about instructing a computer to do something that you
couldn’t possibly have done. So today, we’re going to talk about two of the most
important kinds of those problems in computer science, and I’m going to stop just talking
at you and show you some examples in a second. But those two topics are searching and
sorting. How do you find something in a set of things that you have? And how, if you
could impose some structure on that set, would you both find it more quickly, and – I
guess I got ahead of myself – how would you go about imposing the sort of structure that
would help you find it more quickly?
All right. So if I’m seeming to ramble, that stops now.
So searching and sorting – I say they are important, and I hope you believe me, and you
think that for that reason, it’s worth talking about for an entire lecture. But they really are
just a way of getting to talk about something a little deeper, which is this concept of
algorithmic efficiency that we haven’t said much about in the class so far. So that’s the
third part of this two-part lecture. What’s the deal with searching? Well, searching
doesn’t quite fit the mold as something you couldn’t do by hand. You find things all the
time. Chapter 12 of the book looks at two operations of the searching and sorting. This is
pretty generic. All right. So searching is simpler. You can define a search problem – say
you have an array or some other collection of things, and you have something you want
to find. The typical way this is done is that you want to find the index into the array –
where that element was – or, in a more generic case, you want to find out if that element
is even in the array. So you want to have some way of determining that it wasn’t actually
found.
And that may be all you care about. It may be the case that, if the search routine returns
anything other than negative one, you don’t actually care what its value was. But we
adopt – I will adopt the convention for this lecture that if you don’t find what you’re
searching for, then the method that you wrote to do the search should return a negative
value since that’s not a valid index into the array. All right. So if you have a set of things
that you’re trying to search through, the easiest way of doing that is just to look at all of
them and see if each one, in turn, is what you’re looking for. That’s called a linear search
because, to sort of pre-figure what we’re going to talk about at the end of the lecture, the
number of operations you have to do to find the element that you’re looking for is
linearly proportional to the number of elements that you had to begin with. Now it may
not be obvious right now that there’s something better that you can do, and with an
arbitrary set of data, there is not anything better that you can do. But this procedure that
I’ve written up here is an implementation of what you already could have written – a
procedure for finding an element in an array of integers.
So there’s nothing very tricky about this. If it had been a string, you might have even
avoided having to write the function and called it something like, “Index of” with a
character. And inside of “Index of”, though you don’t have to write the code, it would
have had a four loop that iterated over each of the characters of the string, tested each one
for quality with the key that you were looking for and then, presuming it found one that
equaled that character – or in this case, that integer – it would return the index of it. And
if the four loop completes, then you must not have examined any elements that matched,
and so you return negative one. Okay? So here’s a simulation of how that would work,
although the – leaving nothing to the imagination here, this is pretty easy to sort of
envision.
So we have this array of primes, and it has 17 in it. So we are going to expect to find that
this linear search shouldn’t return negative one. But the second one, we’re looking for 27
– which sort of looks prime, but of course, it’s not because it’s nine times three. All right?
Okay. So we called linear search, and here’s our new stat frame, which has the local
variables I and the parameters P and array in it. This is sort of Eric Robert’s – who wrote
the textbooks – way of displaying the execution model of JAVA. All right? So we’re just
looping over it, and we’re testing as I equals zero and then one and then two and then
three and four, five, six – where should we find it? Well, we were looking for seven – no.
Right now, we’re looking for 27. So we’re just gonna go to the – I think there was a
problem with the way that – well, anyway. The console here prints out that, indeed, we
found 17 at position six. There may have been a problem converting PowerPoint slides to
Keynote slides, so I apologize for the sort of shakiness of these animations. But the
content of the slides should be adequate. Cool.
All right. So now we’re in our second called linear search, and the only thing that’s
changed at this point from the beginning of the first called was that the key is different.
So I is still gonna start at zero, and it’s still gonna iterate over all of the positions of the
array. But we’re gonna go through the entire array, which means going all the way up to
ten where I is no longer less than ten. And we didn’t find 27 anywhere. So I hope this is
all fairly clear. But these are the results that you would have expected. So we found the
position of 17, and then we looked for 27 but didn’t find it and returned negative one.
Cool.
How many people think they could have written that in their sleep? That that was too
much time to spend on such a simple idea? All right. Well, it gets more interesting from
here on out, but – so talk to me. What is the problem with linear search?
Student: It takes a lot of time.
Instructor (Mehran Sahami): It takes a lot of time, right? It takes as much time as you
have elements. So if I asked a question, which was – if I knew the SUID number of one
of you in the room, and I asked, “Is this your SUID number?
And you say, “No.”
And I ask you, “Is 05179164 your SUID number?”
Is it yours? Be kind. Tell me it’s yours.
Student: [Inaudible].
Instructor (Mehran Sahami): It’s not. So it actually turns out that it’s mine, so I would
have had to search the entire room before I finally got back to myself if I didn’t make that
shortcut. Okay.
So another illustration of what is really a very simple idea – that if you’re forced to look
at all of the elements that you are searching among, then that’s bad in and of itself if you
could do better. So we’ll see if we can do better, but here is a rather larger data set – 286
area codes – the three-digit numbers that come before the rest of the seven digits in your
phone numbers.
But we’re trying to find 650, which is this county’s area code. So here’s a larger array.
What would happen if we were searching in this? Well, it depends. If we have a fast
computer, it may not matter. These are – there’s only about 300 numbers here, and doing
something 300 times is something that computers are reasonably good at.
But I thought I’d put together this sort of compelling example of why this can become a
bad idea.
All right. So I wrote a program in JAVA, which has this implementation of linear search
that we were just looking at. And the difference that you may notice is that I have an
instance variable, which is sort of behaving as though it were an array, and you’ll see
why that is in a second. But all that I need it for is to get values from it at certain
positions, and I wanna know what its length is so that I can iterate over it.
And I’m also – so why is it called DIS? It’s short for display, and it’s this – an instance of
this number display class that I’ve written. And if there’s time and interest, I’ll talk about
how that was implemented. But it’s intended to sort of mimic the behavior of an array
and, at the same time, show you a linear search in action.
All right. So this is a dialogue program that’s just reading line to ask me for some input.
And here are all of those numbers.
So I slowed it down – even if this looks like it’s going relatively fast – so that you could
see it sequentially considering each of these numbers. All right?
You may notice that you can – you could probably beat this. Right? It’s going slowly
enough that, since these numbers are already in sorted order, you can jump ahead.
So what is it – I want you to think about – while you’re waiting on this to finish, I want
you to think about what it is that you’re doing when you search for the number in this list
of numbers that happens to be sorted, which allows you to go so much more quickly than
the computer. All right.
So why didn’t we find 650 here?
Student: [Inaudible].
Instructor (Mehran Sahami): It wasn’t even there? Yeah. Tricky. Let’s fix that.
Where do we wanna stick it? Probably between 630 – they skipped whole swats of – I
just pulled this off of the Internet, which is notoriously unreliable. All right. They
probably aren’t gonna let this run all the way through the next time. But 650, as you will
notice, is now right here. So it would eventually be found. I’ll let that tell me when it’s
finished.
So now I’m going to switch over to a different kind of search, which I hope has
something to do with the way a human looks at a sorted array and finds the element that
they’re looking for quickly. But – no. Help me out before I just show you that
implementation. What is the insight that lets you find elements more quickly if there’s
some structure – if there’s a sorted order to the elements that you’re looking for?
Go ahead.
Student: [Inaudible].
Instructor (Mehran Sahami): Okay. So if I start with the first number in the array, and I
ask myself, “Is the number I’m looking for higher or lower than the first number in the
array?”
What’s the answer probably going to be?
Student: [Inaudible].
Instructor (Mehran Sahami): It’s probably higher if it’s a sorted array, right?
So I know that’s not quite what you were getting at, but I’m pushing on the definition of
this procedure so that you – would you like to add a clarification to what you were
intending?
Student: [Inaudible].
Instructor (Mehran Sahami): Yeah. The number in the middle. Okay?
So if you look at the number in the middle, and you see that that number is less than the
number you’re looking for, there’s a whole set of numbers that you now don’t have to
worry about. Right? It’s all of those numbers up to and including the number in the
middle of the array. Right?
So now you’re looking at half of the array, and in some sense, your problem is simpler. In
fact, it’s a lot simpler because if you apply the same procedure again, then you can cut
that remaining half in half, and you can cut the remaining half in half. Right?
So this kind of searching is called binary searching because you make a binary choice at
each decision point, and you can then cut the space that you’re searching in half. So what
would that look like if we wrote it out in code?
Well, if you trust me, then it would look something like this. And the insight behind these
admittedly too-short variable names is that we’re keeping track of the LH or left-hand
side of the portion of the array that we’re considering and also the right-hand side. So if
you look at these two initializations, it should make sense that you start with the left-hand
side as far left as it can be at zero and the right-hand side at the index of the very last
element in the array, which is one less than the length of the array. Right?
And so you may already be able to anticipate that we’re going to sort of move these in
closer and closer to each other until they identify an array that has only one element or,
potentially, zero elements, in which case we would decide that we were not able to find
the element that we were looking for. All right?
So we’re going to do this halving procedure until it is the case that LH and RH have
collided with each other in some sense. And the idea of getting the middle element is
captured by this next line here where we take the average of left hand and right hand.
Now searching is such an important operation, and it’s so often needed for very large data
problems that talking to people in industry who have to do a lot of searching – you may
hear someone say that there’s a bug on this line. And it’s the sort of bug that has nothing
to do with computer science in the abstract, but it’s just one of those details that is
unfortunate – the bug being that you don’t really have to add LH to RH before dividing
them. You can divide each one by two and then add their halves to get the average.
This is just sort of a side comment, but why might it be problematic for you to add them
together?
All right. If this rings no bells – why would you say?
Student: [Inaudible].
Instructor (Mehran Sahami): Integer division? That’s – I think if you work with some
small representative cases, you’ll see that that doesn’t prove to be a problem, although
that’s a good insight. But there is a maximum value – yeah?
Student: What if they’re too large?
Instructor (Mehran Sahami): What if they’re too large? So an integer can be as large as
– an [inaudible] integer can be as large as, like, four billion. But what if each one of LH
and RH is, at some point, three billion or greater? Right? And there’s going to be a
problem adding them together.
But we’ve already spent too long talking about that. I would just throw it out as sort of an
interesting detail. But after many, many years of using a binary search that looks exactly
like this, only recently have people begun to realize that little problems like that can sort
of spell disaster with really, really big problems – really, really big sets of data.
Okay. So anyway. That is somewhat an immaterial point.
We now have the index of the middle of the array – or at least the middle of the portion
of the array that we’re considering right now – and we ask our sort of array-like object
what number is at that index. And if it’s equal to our key, then we’re done. We can just
return that we found the index that we – of something that matched the key. So in the
linear search case, if there were multiple copies of the key that we were looking for,
which one would be returned?

The first one, right?
Do you have any insight yet about which one might be returned in this case?
It’s not at all clear to me. And in fact, binary searches generally don’t guarantee anything
about which element they return if there are multiple copies.
Anyway. Assuming that we don’t get lucky, though, and that we don’t find the element
that we were looking for on our first try, then we have to modify the size – the part of the
array that we’re looking at. So how do we do that?
Well, we need to determine, as was suggested, whether the element that we’re looking for
is less than the middle or greater than or equal to the middle. And in the case that it’s less
than the element of the middle of the array, then we want to adjust which side.
So we know that it must be in the first portion of the array, and we’re going to ignore the
second portion of the array. So the right-hand marker that we’re – is what we need to
update this time. So that’s what this does. It updates the right-hand marker to mid minus
one. And why is it minus one?
These sorts of issues can be really frustrating to be off by one errors in otherwise already
complicated algorithms.
Student: [Inaudible].

Search This Blog