Mongo and Flask Performance Tuning: quick and easy

At Vunify (which is basically the TV Guide of the future) we are a python shop using Flask and Mongo to run our site. Recently as we were finishing up rolling out a new UI (aren’t those always fun to do?) our CEO noted that the main page was taking a while to load for logged in users.

That page has a lot going on in terms of graphical elements on the screen and with its logic on the backend. Finding the cause of the slowdowns wouldn’t be a simple as doing a diff of the new version with the old.

Step one: profile and measure

Before attempting any kind of optimization, you must make measurements. Without baseline measurements in the beginning, you will never really know if or how you improved the code. I decided to start with using Chrome’s developer tools to measure the time it took to download the page.

chrome dev tools screenshot of a mongo flask app

The Chrome developer console. The call to “nick.vunify.com” is the initial call to the server. See how long it takes? Is it mongo or flask that is slow?

Chrome’s dev tools confirmed that a major time sink was the initial call to the server. Once that call completed it was up to the browser to download images and javascript. Since we have more control over the performance of the server, it made sense to look there next.

The back end code consists of a series of calls to mongo to get our user’s data, some per-user logic to process some of that data, then sending it to the jinja template for rendering. This gist can give you a rough overview of what the code looks like:

The jinja template for this view had become quite complex in this latest release. To help with code reuse we have several jinja macros that we use through out the page. The fear was that one of these macros was either inefficent, or otherwise buggy.  Flask does have a very nice debugging tool bar that can do some profiling, but it couldn’t look into the templating code to see where the hotspots where. At this point it made sense to look at the rest of the back end code since we have better debugging tools there. I opted to use print/logging statements in the view code ti measure the time it took for each section of the code to execute. My reasoning for this was while on my local dev machine I can hook up any debugger that I want, I also needed to measure the performance in other environments. Simple print statements would allow me to quick measure and compare performance in all of our environments.  

I moved the render_template() call up and assigned it to a variable in order to try and measure the time it took for the template to render. At a minimum, this would allow me to compare the time spent rendering with everything else. The timing results were very interesting.

As expected, each environment saw different number relative to each other, but they were always in proportion. This is good, it shows that the code is executing consistently across environments (e.g. it runs faster on better equipment, slower on worse equipment). The numbers looked very similar to these:

Step 2: One change at a time

The variation on the calls to mongo was not really a surprise, some of those call pull back a lot of information, others are very small. It can also vary per-user, so I tried to setup my test user with the “best case scenario” so that improvements there would also (hopefully) carry through to other more data-heavy users. It was clear that a lot of time was being spent in the render_template() function, but from what I could find online, there wasn’t a lot of guidance on how to improve performance there. Most suggestions revolved around caching the compiled templates. While that is great advice, these templates seem to be already cached as following the instructions lead to almost no change in render time. Putting a limit() on some of the bigger mongo calls did help, but it also impacted some of the business needs: Limiting the data returned limited the number of options on the screen for the user. One solution that was proposed was to do an “infinite scrolling” on the page, and call back for more data from the server when needed (instead of sending it all at once). While infinite scrolling seemed like a great idea, the implementation cost in terms of time (and a little bit of page redesign) would majorly impact our development timeline. In other words, we had too many other things that needed to get done first.

Step 3: Get a 2nd opinion (from a mongo master!)

As luck would have it I was scheduled to have lunch with my friend Rick who is very good at debugging… and also happens to be a Master of Mongo. When I mentioned the template execution time, his first reaction was that the jinja templating engine runs pretty fast and it would be hard to have templating code that was that messed up that it would drag down performance. “What about the mongo data that is being sent to the template?” Rick asked. “Are you sure its all there?” To this point, I was assuming it was all there. With mongo and mongoengine you can do lazy loading. In that view I was assuming that all of the data was being loaded before being sent to the render_template() function. With Rick’s question burning in my mind, I went and looked back over our mongo calls. It turns out my assumption was not quite correct. We had a model or two that actually had references to other documents in mongo. When those models were being processed by the template, the mongoengine code was calling back to the mongo server to retrieve the rest of the data. Even though our mongo is on an SSD with great low-latency connectivity, this extra network activity was adding significant overhead. The mongoengine documents suggested adding a “select_related()” call to queries that had references like this. I selected a few of the bigger mongo calls that we were doing in the view and added this call. I then re-ran the timing tests:

The functions that had the select_related() added to them did take a few milliseconds longer to execute… but the time on render_template() dropped by about 1/2! The result was that overall the page was returning from the server faster than it was before. As I pushed this change out to each of our environments, I saw the results held. The lazy loading was the culprit.

Conclusions

After adding the select_related() to a few more queries we saw the time it took to render the page in the browser decrease to a point where the boss was happy. There’s always room for improvement, but for now (or at least until our next big UI redesign) things are good enough.

The takeaway lessons from this are:

  • Take measurements before, during, and after.
  • The humble print statement sometimes is the best tool
  • Identify where you hotspots are
  • Ask yourself: “What could cause these slowdowns?”
  • Look beyond the initial answers for deeper underlying causes
  • Talk with someone else about the problem, a fresh perspective can do wonders.

For this situation at Vunify, we had some lazy loading that was going on. This wasn’t obvious at first glance, and only by explaining the problem to someone else was I able to get the clarity to examine the code with a fresh perspective.

Keep Flask from sending too many cookies

In the Python Flask web framework you can run into an issue where after logging in two cookies are sent back in the HTTP headers. One is the normal “session” cookie, and the other one is called “remember_token”.

This can cause a problem with some Android libraries that only expect 1 cookie to be sent back.

The cause of this 2nd cookie is because of the flask-login library. It has a “helpful” feature that will send this second cookie. The tricky part is that you can’t really stop it from doing this by manipulating the make_response response object.

Instead, when you call login_user you have to pass a 2nd parameter to disable the remember cookie:

 

This will prevent that “remember_token” cookie from being tacked onto your request headers. The end result is one cookie and the Android Volley library is able to get the correct cookie.

Mosh, tmux, and twitter: Keeping up on the go

I’m a big fan of twitter and I wanted to share some tips to help everyone get the most out of it. Here’s some of the hacks I’m doing that make using twitter faster and more effective for me.

Here’s the TL;DR version for the impatient:

  • Use a text based client
  • Run the client on a remote machine (or IN THE CLOUD)
  • Use tmux to get the most out of your session
  • Use mosh to have a seamless connection
  • Use twitter lists to get a laser focus on important topics/people

(Follow me on twitter for more stuff like this! @nloadholtes)

Text is where its at

With only 140 chars (and occasionally some pictures), twitter is all about text. So, why not use twitter in a place where text is king: the command line!

There are several command-line twitter clients out there. I’ve been using rainbowstream for a few months, and recently started trying out krill. They each have their pluses and minuses, so I encourage everyone to try them both out and see which one appeals to you more.

Rainbowstream is nice to look at because of its adjustable themes. But… I’ve noticed it tends to “hangup” on the stream and I have to re-switch back to my stream to ensure it updates. Krill is relatively new to me, so far it seems really stable, but doesn’t look as nice as rainbowstream. Krill also has the ability to follow RSS and other formats which sounds awesome. Currently I’m running both side-by-side in a split tmux panel. More on that in a minute!

Run the client on another machine

Every time you close your laptop, you lose your network connection. Wouldn’t it be cool if you didn’t lose your place in the stream? Tweets will come in while you are away, but scrolling backwards will be time consuming.

So, run the command line clients on another machine! Cloud based machines are dirt cheap these days, and they are a snap to set up. If you have one running already, adding these clients takes up almost no resources.

Or… if you are like me and you have a RaspberryPI laying around… you can use that! I set mine up at home and configured my router to route incoming requests to it so I can easily access it no matter where I go. Its always on (thank you battery backups!), so it keeps track of the conversations going on while I’m asleep/commuting/out-and-about. Using ssh keeps it secure and an snap to log into.

The big guy keeps an eye on it for me

The big guy keeps an eye on it for me

And since the command line clients are so lightweight, I can run lots of stuff (like an IRC bouncer/client, a bitcoin miner, etc.) on it with no problem.

Right now you’re probably saying “But hey, you still need to log into that machine every time you open you laptop!” Well…

Use mosh to connect, and tmux to manage!

While I can use ssh to connect to the machine, if the network connection goes down I have to re-login which can be a pain, especially if you happen to be on a very flaky internet connection (like at the coffee shop or at my office). That why you should use mosh. Mobile SSH Shell is a very cool project that uses SSH over UDP/IP. Since UDP is designed to work on unreliable networks, this is the perfect solution.

When mosh (on your computer) notices the network connection goes down, it will try and contact the remote host. When the network comes back, it will automatically reconnect, and update the terminal. The end result is that you can drop in and out without really knowing it, yet you never miss anything!

I mentioned earlier that I was also running an IRC bouncer/client. How I’m going that is with tmux which is a “Terminal Multiplexer”. It basically allows one ssh/mosh connection to have multiple terminal windows. tmux is awesome and if you get nothing else from this article you should go learn and use tmux. (If you are familiar with “screen”, tmux is a much much much better alternative. If you’ve never heard of screen… excellent.)

So, how I’m using this whole set up is this way:

  • mosh into my RaspberryPI
  • start up a tmux session
  • make one pane for twitter, another for IRC
  • Never look back. :)

From that point on, until one of the machines reboots, my computer will automagically reconnect (via mosh) and show me what’s happening on twitter or IRC. AWESOMENESS!

This is what my setup looks like:

A screenshot of my twitter tmux setup

A screenshot of my twitter tmux setup

Use the lists!

A while ago twitter rolled out a feature called lists. Lists are just ways to subscribe to users but not have them in your normal timeline. The beauty of this is you can now create lists centered on a topic, and fill it with relevant accounts.

For example, I have a list called “funny” that has some accounts that spew out jokes and other humorous quips. I also made a list called 5 which is a very focused list of people who I think are doing important and inspiring things, and these are people I should pay attention to. (The name of the list is from the Jim Rohn quote “You are the average of the 5 people you spend the most time with”.)

I then use the command line clients to watch these lists. Since the lists are focused, the are not as “active” as the normal timeline. This way I can cut back on the distractions and time suck that the normal twitter timeline can become.

On a normal day I check in on the 5 list a couple of times, and occasionally check funny, or one of my other lists. The beauty is now I’ve trained myself that it isn’t necessary to check twitter constantly because a) it won’t “update” as frequently, and b) when it does, I’m more likely to like the content.

Wrapping up

Use a lists to get focused on a topic that is important to you, use a command line client to keep it fast & clean, and host it somewhere other than your machine using mosh and tmux for extra awesomeness.

Here are the links to everything:

My talk about Redacted Tweets

Last month I gave a presentation about my Redacted Tweets project at my local python users group, PyAtl.

Originally it was supposed to be a lighting talk, but I stretched it out to a full talk because one of the scheduled presenters couldn’t make it. Thankfully it is an entertaining story and everyone seemed to really enjoy my presentation.

Here are the slides that I presented. It was a lot of fun to tell everyone about this goofy little project. If you have any projects like this laying around, I would encourage you to present them at your local meetups! It is fun to share what you work on, especially if it is something funny that can make people laugh.

Hiring Hacks: Track your activity

When looking for a new job it is very easy to get overwhelmed, yet feel like you are not getting anything done. This combination can lead to a negative feedback cycle which leads to you feeling worse.

To prevent this, every day you should write down the activities that you did. And just as  importantly, you should write down the results. Let’s look at why and how this can help.

What did I do?

There are 24 hours in the day. Each hour is composed of 60 minutes. Every day we all do a countless number of things, some important, but most are mundane. (Think of things like drinking water, looking out a window, etc.)

If you take the time to write down the important things, they will stand out in your mind. The more they stand out to you, the more attention you will pay attention to those tasks. And if the task is important and you are paying more attention to it, you are more likely to do a “better job” with it.

But what’s important? It all depends on what your goal is. If you want to get a new job, then anything that gets you closer to that goal is important. This includes sending out resumes, talking to people, and any kind of reading or practice that you do to help you improve your skills.

Think about it: If it is important to get a job and your track “important” activities such as emailing your resume, you’ll start to notice how many times you’ve sent your resume out.

Why track these things?

It can be hard to make ourselves do something we find un-fun or unpleasant. But one thing that that humans are very good at is playing games. Think of all the hours you’ve spent playing games. Things like Famville, twitter, and traditional video games tend to be lots of fun, that’s why we do them!

If you start tracking how many resumes you’ve sent out, then it will start to look like a game. “Yesterday I sent out 3 resumes and got 1 phone call. I wonder if I sent out 6 resumes if I’d get 2 phone calls?”

Doing important things

Once you start tracking what you did, you will notice that your activities tend to cluster together into certain categories. For me, I’ve noticed it tends to be “entertainment”, “learning”, and “job hunting”.

From those 3 things its pretty clear that one of them is going to be pretty important if I want to get a new job. The other two are important, but if I’m doing them too much, then I’m not going to be moving closer to my goal of getting a new job.

By make a list over several days you too can start to see where you are spending your time. Ask yourself this question: “Am I doing the things that are going to get me closer to my goal?” If the answer is no, then It should be pretty clear what you need to stop doing in order to make the change.

One more thing: tracking what you do helps out if you need to file for unemployment. Typically if the government is going to give you money they want to know what you are going to do to “earn” it. Being able to show details of your day shows that you are serious about getting back into the job market.

Wrapping up… for victory!

Its easy to lose track of time and focus while job hunting. Here’s what you need to do:

  • Every time you do something important, write it down!
  • Every day, look at what you’ve accomplished
  • Every week look back at what you did
  • Decide what was important, and do more of that!
  • Decide what wasn’t important, and do less of that 😉

Using these hacks you can get your brain focused and see the progress you are making, and more importantly get to your goal faster!

Hiring Hacks

Recently I’ve found myself in the job market. In the past when faced with this situation I’ve simply thrown my resume out there and hoped that someone would see it. These days though, I tend to follow a better strategy to make sure more people get to learn about me. Here’s what I’m doing:

  1. Make sure your resume is on a public webpage somewhere.
  2. Use a link shorting service like bit.ly to create a link to your resume
    1. Optional: make the shortened url something unique like your name
  3. Post your shortened link in your various social media sites.
  4. Go to bit.ly and check the stats on your resume!

One version of a resume

In step #1 what you are doing is making a central copy of your resume. A big problem is that recruiters and hiring managers will get emailed a copy of your resume and will then put it into their system. Over time you might want to update your resume, but those old copies are still out there! By having a single copy that is publicly visible you can always point people to the “latest and greatest” version of your resume. And any changes you make get seen right away.

Personally, I have my resume in a Google Drive Document. In addition to being a full document editor like Word, they allow you to share a document as read only (or editable, but I would not recommend that for an important document like your resume!). As a bonus, if someone looks at the document while you are looking at it too, you can see where their cursor is, and that might let you know what is the most important sections.

The link

bit.ly is an awesome little service. It lets you turn long web addresses into short little links which are perfect for instant messages or tweets. (The shorter the link, the less likely someone is to mess it up!)

Sign up for a free account there, and then you can put the link to your resume in and create a shortened link. As a bonus you can specify what the shorted link looks like you can personalize it.

I did this and it gave me the link: http://bit.ly/1EGdxOx which points to my resume hosted on google drive. That link looks much nicer than the normal google drive link (which is https://docs.google.com/document/d/17R6XoxhaUt5ywEKinZVSDYqQaJGRLJVEnOu5LaVkmzY/)

If you don’t want to create an account on bit.ly there is still a way to check on your stats. Read on for more info!

Tell the world!

Every social media site claims that they are the hottest spot, but how do you really know? By sharing this shortened link on your various accounts at the same time you can find out exactly who is interested!

I shared my link on Twitter and LinkedIn because those are the two spots where I’m most active. Both of these sites have the ability to tell you a little bit about your posts: Linked in can tell you how many people have looked at your profile, and twitter can tell you how many people clicked on links.

But one thing they can’t tell you is how many people looked at the link and then shared it via email or instant message. This is where the bit.ly link comes into play.

Is anybody out there?

Once you’ve shared the link then you can visit your bit.ly account and view the stats for your link. Here’s a picture of what mine looked like:

Screenshot

I created the link and tested it out. I then waited until Monday morning and then posted it around 7:30am. That’s the huge spike there. As you can see there were over 30 views! This is great information to know because it lets me know how active my resume is. As the week goes on I can re-send out the link and I should see a new bump. This is important to know because if I just sent the link out again on Tuesday it might not have been seen by as many people.

Further down the stats page is another important chunk of information, how the link was shared:

Screenshot-1

This was very interesting to know. I have a similar number of followers on both LinkedIn and Twitter, but twitter was twice as much as LinkedIn! And the “Unknown” was huge! The “Unknown” was for referrers that didn’t identify themselves to the browser. So that could be email links, chat messages, or some other situation.

(The “Who shared a bitlink” thing shows me because I’m the only one who posted this link to bit.ly which isn’t surprising: this type of link it going to be shared via copy-paste instead of someone retweeting it.)


BONUS HACK!

If for some reason you don’t want to make an account on bit.ly (no judgements, I totally understand!) here’s a neat little hack you can do to find out some stats:

Take any bit.ly link and put a + at the end of it. This will tell bit.ly to take you to their stats page for that link. This works for pretty much any bit.ly link and it is really interesting to see how some links are spread across the internet. Thanks to Hilary Mason for this awesome tip!


Measuring the effect

The goal of all this is of course to get a new job. Normally I would just measure my progress by the number of phone calls and interviews that I’ve had. Now I have an additional data point of knowing where and when my resume has been read.

Thanks for reading! If you’d like to talk be to check out my resume or hit me up on twitter.

Installing OpenCV for Python on OS X

OpenCV is a computer vision library. It is a really powerful library and has bindings for Python. The only thing that it doesn’t have is a good explanation of how to make the python bindings work! Here is how I got it to work on my Mac (running OS X 10.9.5) inside of a virtualenv.

  1. Install cmake (brew install cmake) and the normal development tools
  2. Download and unzip the OpenCV code
  3. Change directories into the OpenCV directory
  4. Type in “cmake CMakeLists.txt”
  5. Type in “make -j5” (this will use 5 threads and make the code build pretty fast)
  6. Type in “make opencv_python2”
  7. Type in “sudo make install” (to install all of the code)

At this point the python code has been installed to the main system python in /usr/local/lib/python2.7/site-packages which is not very helpful if you are using a virtualenv (which is what you should be using if you are working in python!)

The next step is to copy the OpenCV files from the global directory into your virtualenv. This is done by typing in the following:

cp /usr/local/lib/python2.7/site-packages/cv2.so <path_to_your_virtualenv>/lib/python2.7/site-packages/

This will copy the .so created during the build to your virtualenv which will make it accessible to your python code. At this point you should be able to fire up the python interpreter in you virtualenv and type in:

import cv2

and it should work. Happy Computer Visioning!

Making a plan B

There’s an old saying that “no plan survives first contact with the enemy”.

There is a lot of truth in that statement in a lot of situations. It would seem to say that we shouldn’t bother making plans, but I see it a different way: make plans that are flexible.

Rigidly following a course of action is rarely a good idea. Failing to recognize that something isn’t working is something a lot of people are able to do, but the critical second step is to make a change is something that is rarely done.

Having a Plan B is usually a good idea for anything important. But if your backup plan is just as rigid as the first plan you will have the same problems. A better approach is to make sure your plans can adapt to difficulties you encounter.

For example, if you are debugging software and nothing is working, try stopping what you are doing and approach it from a wildly different angle. You will still be accomplishing you goal (debug the software), but from a different angle.

This ability to change up your approach is the ultimate Plan B. It allows you to move forward maintaining your momentum. You still reach the same destination, but hopefully faster. The root idea is try and overcome your functional fixedness.

Here’s some random examples:

  • You have a flat tire. Your spare is flat too. How do you get the car/tire to the repair shop?
    • Call a tow truck?
    • Use a bicycle pump to get just enough air in the tire for you to drive it to the shop?
    • Take the tire off the car and get a friend to drive you to the shop?
  • You need to edit a large file on your computer, but you don’t have enough free disk space.
    • Delete other old files to free up space?
    • Try and hook up a USB thumb drive and do the work there?
    • See if there’s a way to do the work without copying?

There’s tons of situations in daily life that can be tackled in new ways. All it takes is the ability to remain fluid in our approaches to solving them.

Making Progress

I have a problem. Actually two problems.

One problem is that I have this feeling that I can’t shake. A feeling that I’m not doing enough, or getting enough things done. The second problem is that I have this strange hesitation to post on my blog these days.

The 2nd problem is related to twitter, I tend to post things there more frequently. Short thoughts, less friction. The first problem has a bit of a challenge to it.

Most people by default tend to look at their own achievements in a less-than-flattering light. I think the main reason for this is we tend to “forget” things as we get further in time from them. For example, if you had a really great day at work, but then had 2 weeks that were really bad, you won’t see the really great day as the accomplishment that it was: instead you are overwhelmed with the most recent events (which in this case were not so great).

I’ve decided to tackle both these problems by blogging things that I do shortly after I do them. This way I’ll get two birds with one stone; more frequent updates, and a record of the cool and fun things I’ve done recently.

Now having said all that, here’s what I did in Januaray:

Raspberry Pi – ZNC

Last year I got 2 raspberry pi computers. Very cool little machines, they are about the size of a credit card and are only $35. But… what should I do with them? An answer finally hit me in the form of IRC.

IRC seems to be making a comeback, a lot of interesting/cool people tend to hang out on it. So not wanting to be late to the party, I decided I would hang out there too.

The problem is I could be connecting to IRC from one of several different machine (work machines, home machines, phone, etc.). The solution is to use an “IRC bouncer” like znc to keep my connection. Znc acts as a single sign on point, I log into it, and it keeps my connections to the IRC network alive. It also logs the conversations so I can scroll back on different machines and never miss anything.

The only catch is that znc needs to be connected to the internet all the time to maintain the connection. Not wanting to keep my power-hog home PC running all the time, the flyweight raspberry pi suddenly seemed like the ideal server. It is low power (it runs off of a micro usb connection), and runs linux. The perfect combo!

So with a little research (several other people had the same idea) and a little bit of time I was able to quickly accomplish the following:

  • Setup a Raspberry Pi so it is on the internet 24/7
  • Setup a dynamic dns program so I can get to the Pi (even if my home network gets a new IP address)
  • Setup znc and have it connect to the IRC networks of my choice
  • Set up SSL so that everything on the IRC side is encrypted

The last one is my favorite so far. With all of the security talk going on, a little more encryption is a great idea.

I’ve got a few more network-aware apps that I’m thinking about putting on the Pi, but this is a great start. And the next time I do something I’ll have it posted here! A win-win.

Running python tests randomly with randomize

Recently I was having a conversation with a co-worker about some test problems we were having. While we were brainstorming the idea of running the tests randomly came up. A quick search showed that someone else had been thinking about this issue too, but it looked like it never went anywhere.

The code that we found was here on google code but from the discussion it seems like the code was never included in nose, nor was it submitted to pypi. There was what looked like one repository on github that had this code, but it too wasn’t in the best shape.

So… I decided to grab the code from the Issue on Google Code and start up a new Github repo for it. I added several tests and fixed 1 little bug that I found and today I released it to the world.

If you are doing testing with nose in python, you can check out randomize, a nose plugin for running tests in a random order. The main reason you would want to do this is to ensure that there are no dependencies between your tests. This is very important for unit tests, they should be isolated and independent. This plugin helps confirm that isolation.

How it works: Basically as classes are read into the testing framework the plugin gets called and it will apply python’s random.shuffle() to the tests to produce a random order of the tests. One shortcoming of the plugin is that is only shuffles the tests, not the classes that hold them. (If anyone is interested in implementing this, please feel free to send me a pull request!)

Installation is simple. On the command line just type in:

pip install randomize

and then you’ll have it installed and ready to run. To use the plugin, all you will need to do is this:

nosetests –randomize

And that will invoke it. When it runs it will print out a seed number and then begin executing the tests. If for some reason you need to re-run the tests (say to troubleshoot a test failure), all you need to do is run:

nosetests –randomize –seed=<the seed number>

and that will re-run the tests in the same order.