Welcome
-
Recent Posts
Recent Comments
- Souvik patra on রুবিকস কিউব (পর্ব ৪) : রুবিকস কিউবের সমাধান
- Al amin on গণিত অলিম্পিয়াডের প্রস্তুতি: কিছু সাধারণ প্রশ্নের জবাব (FAQ): পর্ব ১
- Rakin sadab on গণিত অলিম্পিয়াডের প্রস্তুতি: কিছু সাধারণ প্রশ্নের জবাব (FAQ): পর্ব ১
- MD. Rakin sadab on ফিজিক্স অলিম্পিয়াডের প্রস্তুতি নিয়ে কিছু কথা: ইমরোজ খান (BUET, EEE)
- Riyan on ফিজিক্স অলিম্পিয়াডের প্রস্তুতি নিয়ে কিছু কথা: ইমরোজ খান (BUET, EEE)
Archives
Categories
-
Links
Tag Archives: programming
CS 165 or: How I learned to stop worrying about speed and Love the column stores
This Spring 2014 semester, I took CS 165, a course on big data database systems by Professor Stratos Idreos. This course was offered after quite a few years; so there were quite a few of us interested. However, for myself it was a course that I have been waiting for a few years to take. As I am very interested in data science and related topics, how database works under the hood is one of the basic principles I need to know before I can do any large scale analytics efficiently. Especially the slide from the first lecture sums up the necessity of learning about DBs perfectly:
When I took the course my expectation was that this would probably be a course where we’d learn about SQL and hadoop/hive for big data. However, when I went to the office hour of Professor Idreos for the first time he said that “you don’t come to Harvard to learn about [how to code in] SQL!” So as a column storage expert, he decided to teach us about the database system in the lights of the column store paradigm.
Column stores are different than the regular SQL like row stores. The store data in columns instead of rows. So when we query data the row store has to go through the whole row, whereas in column store we can just work with the specific column(s) we are interested in. This enables massive performance benefits in many cases, including scaling of the the database. I will talk about that in details in a bit.
I was excited and terrified at the same time, because having not taken any systems course (other than CS50, which does not really count) I knew that it would be really hard to make a whole working database system from scratch in C. However, I decided to stick with it. It was not easy, but I am glad I did! I feel this is one of the classes at Harvard where I learned the most. I learned how data systems work under the hood. I learned how to maintain a codebase with about 5000 lines of code. I truly learned how to use pointers and memory management. Finally I learned about low level CPU, GPU architectures and how to leverage the CPU caches to write cache conscious code. Beside these in terms I learned a lot of lessons that would pass as the big picture-take-away from the course:
Lesson 1: The real world is messy
As cliche as it might sound, the real world is too messy. It is easy to say “oh, design a database in a distributed fashion, when you have a server and client and just make the server and client send each other streams.” However, when we implement that it is a pain, to say the least. Moreover, when we get the command/data we need to parse that. Parsing strings in C is honestly no fun…or maybe I am too spoilt by Python.
Lesson 2: Sockets are neat, but painful…aargh
We have a lot of ports in our computers which we can use to communicate between computers in a network. As we had to make things distributed (i.e. many clients in a network can talk to the server) we had to make sure that the communication is actually happening. It is slightly nontrivial because at a time one computer should send message and another should listen. However, if both of them send or both of them listen, like real life, nothing productive will happen! Also the second thing I realized was that when you program your node to send it does not necessarily send the message. For example if you send 8 bytes to server, the client might not send that immediately, rather wait for more message and buffer to optimize the communication. This is a really neat and smart thing, but that means you have no control! So that kind of sucks. To get around that I had to code up a whole streaming protocol that does it correctly and that was the bane of my Spring break!
Lesson 3: B+tree…oh the black magic
If you have done any amount of CS (given you have read so far I am assuming that you have), you must have heard of the binary trees. If you had the opportunity to implement one for the first time, you probably were not the happiest person in the world. Now take that binary tree and make that more general (i.e. each node can have arbitrarily many keys). As you can probably understand, it is pretty complicated to implement a B+tree. So it was just solid 500+ lines of uncommented C code!
On the plus side, there are great advantages of B+tree. First of all, due to the structure your leafs’ keys are always sorted and you always get a balanced tree! When I saw it the first time I thought that it is black magic! However, later I realized (doing some complexity analysis) that the cost for in-memory B+tree is the same as binary tree. However, the I/O cost is significantly lower. So if you have a disk based implementation for really huge database, then it would be strictly better to use a B+tree with high order (lot of leafs).
Lesson 4: It takes a lot of hack to make something work
It is easy to throw buzzwords like distributed system, cache conscious, threaded, parallel, and so on, but it is really hard to make it work. It takes a lot of effort to make a system work and the learning curve is steep.
Especially when I was dealing with threading (to make commands from all the clients run in parallel), I realized how messed up things can get with each thread doing their own thing in parallel. And my final version is not really thread safe, but at least I understand the concept of race conditions, thread locks etc.
One hack I was particularly proud of was lazy fetching for joins. So when we make a selection and fetch, we generally have to go through different parts of the column 1+1=2 times. However, lazy fetching does not do anything in the fetch state and waits for that to happen in the join state. So I just attached the pointer to the original column (casted void!) for that and it worked really well. So in general my join would save quite a bit of work for this lazy fetching.
Lesson 5: Making things cache conscious is hard
When we have any kind of data it moves from memory (RAM) to the L1 cache of the processor which is just next to the layer of registers in a processor. So ideally you want all your data to be in L1 cache, because it takes many times more time to get data from disk/RAM compared to the L1 cache. However, L1 cache is small (something like 32KB for a core, depending on your CPU). So the best thing you can do is that to make sure when you do any computation, you don’t need to push the memory back and forth (because the CPU would push the memory back to L1 cache if it is not actively using it). So we need to make sure when we do computation with a chunk of data we do everything possible with it. This is basically the idea of cache consciousness and it is pretty hard to do this kind of optimization. So I ended up writing code like the following for loop join!
Lesson 6: Join is fascinating!
Ii is said that the most important thing invented in computer science is hashing. Although it is arguable, I know after this course that the claim is probably true. When you do join, the most naive approach is to go through each element of the second column for each element of of the first. This is clearly O(n^2). But we can do better, and sort both and do a merge sort join, which has cost O(n log n). But we can clearly do much better with hashing. We just create a hash table for the smaller column on the fly and then probe it with the elements of the bigger column and this brings it down to O(n)– the holy grail! This works really well in most cases and the final result of hash join is no less than impressive! Maybe I will write another blog explaining the joins.
Random knowledge:
1. In C you have to truly understand what pointer and memory (i.e. the “stuff” the pointer refers to) are. Without much systems knowledge I screwed up so many times! So what I did once was when I was passing the name of columns I read from files, I did not copy them. So it was basically that I was passing the pointers to the string. So every-time the column name got rewritten and it took me a good amount of time to figure that out.
2. void * pointer is probably the best thing about C (or maybe not…). One of the reason I loved python was you can make your functions polymorphic and deal with anything. Now with the void pointer although you might have to consider cases for different datatype, you can pass different things in a single struct! Which makes life really, really easy. For example, I had two major data structs bigarr for array based columns and node for B+tree. So in my column struct I could pass either of them by casting the pointer to a void pointer! Magic!
1 2 3 4 5 6 7 8 |
typedef struct { char *name; int type; // 0 = btree, 1= bigarr, sorted, 2=bigarr, unsorted, 3 = binary table (normal), // 4 = binary table (b+tree) void * object; // pointer to array or tree structure // binary table: val array/ B+tree bigarr * sec_obj; //(optional) unsorted array if b+tree // binary table: positions, }col_data; |
3. i++ vs ++i: This is most likely a very silly thing, but as we learned about the importance of writing tight loops writing something the following is necessary.
1 2 |
for (int i = 0; i < col->size; ++i) col->arr[col->size++] = i; |
Although I don’t really know how much that helps in terms of reducing functional stack overhead, but it is cool and compact at least! So the difference between i++ and ++i is that the former returns i and makes i=i+i, while the later returns i+i and makes i=i+1. Fancy, eh?
Performance study and final remarks:
Cumulatively I probably spent 200-300 hours on this project over the last the last three months. So it would be really disappointing to see a sub-par result. However, the column storage did not disappoint me! Because we just deal with at most 4 columns at a time, the joins in column store scales unbelievably well. The following graph shows the performance when we had N columns with 100,000 rows each and we see that when N increases PostgreSQL the most advanced row based store fails so hard! Look at this sweet graph!!!
CS165 was more of an experience than a class. As I haven’t taken CS 161, I can’t comment on the comparative difficulty of these two, but hey at least I can brag about writing about 5000 lines of code and making my own database system that kicks the ass of row storage!
[I plan to keep deving the DB in future. The refactored repo can be found here: https://github.com/tmoon/TarikDB/]
Play with Python (Part 2): Preparing for the Magic
(After the previous part: Part 1: The Magic Behind Computers)
Summary: Install python, learn how to start coding, and plan to successfully finish the tutorial!
How to install Python:
I have lectured enough about the reasons for learning programming and the theories behind computers, but I know that you are not here for those! You want to learn how to make that game, right? But before you do so, you need to install python in your computer. First, you need python installer (skip if you are using ubuntu or linux). Get the python installer from this link: http://www.python.org/getit/
You can see a lot of installers here. Don’t get confused! For this tutorial we will use 32-bit python 2.7 (at the time of the writeup the latest version is 2.7.6). So download the correct installer file to your computer (see the image above).
Windows:
As usual double click on the installer and let it get installed (click ‘next’ and ‘ok’ many times!).
OS X:
Double click on the package manager and install the package.
Ubuntu/Linux:
If you are using Ubuntu you’re all set, because Ubuntu comes with python pre-installed!
Basics and Sanity Check:
So when you write code, Python has an engine (interpreter) that turns your code into instructions that computers can read (those are basically collections of 0’s and 1’s and only computers can understand them). To give commands to computers python has a program (interface) called IDLE. Now to check if the installation worked properly, open IDLE from your programs.
(Windows: Start Menu>All Programs > Python2.7>IDLE (Python GUI), in general you should get it in the list of programs in any operating system).
Now you should see the following window (minus the code text):
Now the “>>>” is the terminal where you can enter any valid python command and it will show you the output. Now type:
print "Hello world! I am learning to code!"
And you should see the output text. Congratulations! You have written your first piece of code!
So what you just did is that you outputted some text on the screen and print is the command for doing so in python. We will explain all these in a lot more details in the next chapters, but for now enjoy your new super power of being able to code!
Now try to do these math with python (We have also included the output. Of course you don’t need to type those!):
1 2 3 4 5 6 7 8 9 10 11 12 |
>>> 2 + 3 5 >>> 5 - 1 4 >>> 2 * 3 6 >>> 4 / 2 2 >>> 3 % 2 1 >>> 2 ** 3 8 |
First thing you noticed is that I have added white space in between everything. You don’t really need to do this, but this makes your code look a lot cleaner. So before and after arithmetic signs we add these spaces. Finally, what let’s figure out what these signs are doing. The first one + = addition, - = subtraction, * = multiplication, / = division. Let’s stop for a second– “but what are the next two?” you may ask. and you probably have not seen these before. Don’t worry, they are easy to understand!
The first one is “%” is modulo sign. In python a % b means you are asking for the remainder of a upon division by b. For example: 3 % 2 gave us 1 because that’s the remainder you get when you divide 3 by 2. This is a great way to check if a number is odd or even! If the remainder is 1 the number is odd, and even if it is 0!
The next one, ** means to the power. So a ** b means ab in math. And that means you are multiplying b number of a’s together. More explicitly: 2 ** 3 means 23 = 2 x 2 x 2 (i.e. multiplying 3, 2s).
Play with both the modulo and power operations on the terminal before these concepts make more sense!
Starting to Code:
Now you’re probably thinking, if you enter one line at a time and do simple addition and subtraction then probably you cannot do much. You are absolutely right! That’s why programmers write their programs in a simple text file and then make the language compiler software to run it. As we are becoming pros, we will do the same!
Select File > New Window (or press control + N). You should see a blank screen. This is just a fancy text editor (like notepad), but as you type you will see that the text contents are getting nice color coding (the color coding has a fancy name– syntax highlighting)! Now type:
1 2 3 4 5 |
# Printing print "Hello world! I am learning to code!" # Arithmetic operations print 2 + 3, 3 - 2, 2 * 3, 4 / 2, 3 % 2 |
Finally select: Run > Run Module or press F5 key on keyboard and it will probably ask you to save the file. Now give the file any name and end the name with .py (we named it first.py). It is the python file extension (like .pdf, .exe etc) and if you do this the computer will understand that it is python code! Now you can see the output printed in the terminal. Neat!
By the way, the gray texts after the hashtags (#) are called comments. When python runs the code, it goes from top to bottom, but ignores the comments. The goal of these comments is to make the code more clear to the human readers, but you can choose not to type them.
How to a complete this tutorial successfully:
Remember the last time you sat down to learn something? Yes, I do too! And I did not finish it! So it is important that you follow these instructions to learn python successfully:
- Set one hour (or maybe half an hour of time daily). This will make sure you are learning everyday.
- Go through one or two lesson(s) at a time. You might be tempted to go through everything in a single sitting, but that way you will forget everything really fast. So don’t rush and take your time.
- Type everything up yourself. I know it is really tempting to copy-paste the code, but don’t do it! Typing up will ingrain the lessons and commands to your brain as so called muscle memory. So after a while you will be able to code without looking at any tutorial or help!
- Remember the cool game we will be making at the end? Good!
- Finally don’t be disappointed if you don’t understand anything. Especially if you find anything unclear leave a comment below and I will try my best to help.
I know it was lot of materials for a single lesson. But hey! Now you know how to write a legit piece of python code! If you did not understand anything– no worries, because we will cover everything in details. Just wanted to make sure you know how to run code and where to write code to run them!
Happy coding!
Previous Part: Part 1: The Magic Behind Computers
Next Part: Part 3: The Magic Starts with Variables
All parts: Play with Python
Play with Python (Part 1): The Magic Behind Computers
(After the previous part: Part 0: Introduction)
Summary: Introduction to programming and algorithms, reasons for choosing python.
What is programming?
In simple words, programming is just talking to computers. It is similar to learning a new language in the sense that you learn how to order a computer to do something for you. However, the problem is computer itself is really dumb (no matter how smart things it can do after getting programmed!). So you have to give it painfully clear and explicit instructions to make it do anything.
Before I give a concrete example of programming, imagine you have an assistant who does everything for you. However, the only problem is that he is really dumb (still smarter than a computer). Yesterday you asked him to make simple peanut-butter sandwich for your breakfast and he made a mess! So today you are giving him instructions:
- Get whole-grain breads
- Put peanut-butter on them
- And finally put those breads together
No matter how dumb your assistant is, hopefully after this instruction he will be able to make a sandwich, but if you ask a computer (assume it is a robot with hands!) to do this job, it will still fail because all of the statements are still ambiguous. Let’s go through them again:
- Get whole-grain breads: How many? 1, 2, 100?
- Put peanut-butter on them: Both side single side? How to put peanut-butter? What to use? How much peanutbutter to use?
- And finally put those breads together: How should we put them together?
So let us write another instruction, which is very explicit and clear:
- Get 2 whole-grain breads, 1 peanutbutter jar, 1 butter knife, 1 table spoon
- Take one of the breads and put 1 table spoon on peanutbutter on it.
- Then smooth out the butter with a knife.
- Similarly do the other loaf.
- Finally put two breads so that the faces with peanutbutter on top are facing each other!
Now you understand what I mean by painfully unambiguous and clear! Watch the video from CS 50 to see how things can go wrong if you don’t have a precise algorithm.
The recipe you just made is called “algorithm” in computer programming. Before you solve any problem you need to solve it yourself and give the computer all the ingredients and the recipe (algorithm) to solve it.
Now you may ask, why do we use computer if we need to solve the problem ourselves first. Very good question! The answer is that a computer is very, very fast compared to a human being. Once you teach the computer how to solve 1 class of problems, it can solve any of those problems in future extremely fast. This great advantage of fast automation is what made computers so popular.
If you want to know a bit more about algorithms you should check this nice animated tutorial video by Dr. David Malan, the famous CS50 course instructor from the Harvard University.
Why python?
You probably thought about a big snake when you heard the name python. Here we are of course talking about the computer programming language Python. Like there are many languages in the world, there are many programming languages to talk to computers. So you may ask, why I chose python and why you should learn python instead of any other languages. In general, people use different languages for different purpose, but python is one of the very few languages that is used for almost everything. So if you learn python you can pretty much do anything!
That is probably the main reason python got so popular in the last few years.
Apart from that here is my laundry-list of reasons for which I love python and you should too:
- Python is smarter. In python you don’t need to tell the computer that you are working with a number when you can clearly see that it is a number.
- As a result, you need to write a lot less code than most other languages for doing the same work, which is great if you are lazy like me!
- As python is very popular people have written a lot of code in it and made many libraries (extra functionalities) that are freely available for your use. So basically no matter what you want to do with python almost always you will find a library. So instead of working hard and coding that functionality, you can work harder on real problem solving and do more with less code!
- Finally according to xkcd, python helps you to fly!
1 2 3 4 5 6 |
''' Happy Learning! Tarik Adnan Moon Harvard College, Class of 2015 ''' |
Previous Part: Part 0: Introduction
Next Part: Part 2: Preparing for the Magic
All parts: Play with Python
Play with Python: A Beginner’s Guide to Programming- Part 0
Why Programming?
Why learn coding when you can learn so many other things like playing a guitar, watching a movie, or just chatting with your friends on Facebook? I am not asking you to stop doing any of those, but just think for a second how cool it would be if you learned a bit of programming and could make your own games, animation, website, or even a guitar tuner that helps you to tune your guitar! If you do any of these, I can guarantee you that it would be one of the most fun things you have ever done. And that’s why I am writing this tutorial to help you learn programming.
If you look around, you will realize that everything is becoming so computer dependent now. From your cellphone to traffic lights on your way to school, everything is run using computers. Do you know who made computers do these amazing jobs? Those are smart people like you, but they know how to write programs to make computers work for us.
From mathematics, economics, physics to political science and history, people are using quantitative methods and computer programs to analyze data and find new theories based on those. So basically no matter what field you are interested in computer programs can be very useful. Plus if you look at the mainstream technology companies, you will see that they are doing really cool works to make our life easier by creating products like facebook, Instagram, or google search! And if you need inspiration from them about why you need to learn coding watch the video made by superstars!
Maybe you are thinking that when you grow up you will do business or become an artist and live in Paris (or maybe you already are a businessman or an artist). You can probably hire someone to do these. That is true, but you still need to know how computer works to even understand what you need. Plus if you can code you can make your own website or understand what you need to add to your website! If you are a business-person, you will understand better how to run your company better. Moreover, at a basic level learning coding teaches you problem solving skills, which are valuable no matter what you do in life. Watch this video by Scientist to learn how learning programming or science in general can rewire your brain! Watch the video by Neil deGrasse Tyson if you need more convincing.
Who can Learn?
You! If you were patient enough to read up to this point, you can learn programming! The approach to teach problem solving and programming python will be similar to story-telling. I will try to tell real life stories and problems that you will solve with your coding skills. And at the end of the course you will work on a final project that will need all the ideas you will learn throughout the course.
This tutorial will not have any advanced math component, so if you know how to count — I mean add, subtract (and sometime multiply and divide!) numbers, you should be fine. In terms of age group, this is most suitable for 12 – 18 years old people (mostly middle to high school students), but I don’t see why people of other age group will not be able to learn from it.
Finally, as I have mentioned the final project, what is the final project, you may ask! Wait for a surprise— you will make a full blown computer game (with nice graphics, animation, and sound) from zero knowledge of programming! And I will help you to go through this journey to be able to do so. Imagine from less than a month from now, you will be able to create your own game and show it to your friends! How cool is that!
1 2 3 4 5 6 |
''' Happy Learning! Tarik Adnan Moon Harvard College, Class of 2015 ''' |
Next Part: Part 1: The Magic Behind Computers
All parts: Play with Python