I’ve been doing the long travel from Linz to Berlin to see #mongoberlin and it was worth it. Three tracks, a total of 27 sessions, and lots of things to learn and to discuss with others.
For those of you who are not familiar with MongoDB, it’s a document-oriented, schema-less database that comes with a very small footprint and suits perfectly to scale on the web.
Especially useful for me was the session about Indexing and Query Optimizing, and as that chapter is not so well-described in the Online-Docs of MongoDB, I want to share some of my experiences with you.
First – if you’re not familiar with indexes at MongoDB, please read the docs and use them!
I have a collection in my MongoDB called Offers, which contains about 500,000 entries. Assume an offer has a date when it was created and it has price. We now run a query on the mongo shell to get all offers where the price is between 500 and 1000 Euros. As we have so many entries, we only want to see the first 30 entries, sorted by price ascending. That query would look like that in the shell:
> db.Offers.find({price: {$gte:500, $lte:1500}}).sort({price:1}).limit(30)
Without any indexes having applied, this query takes 528ms to execute on my machine. Now let’s apply an index on the field price:
> db.Offers.ensureIndex({price:1})
‘1’ means that the index is ascending. Running the query again now returns in 1ms!
That would be a good point now to mention the explain() command. You can append that to any find()explain() command before creating the index gives us the following information:
> db.Offers.find({price: {$gte:500, $lte:1500}}).sort({price:1}).limit(30).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 467560,
"nscannedObjects" : 467560,
"n" : 30,
"scanAndOrder" : true,
"millis" : 528,
"indexBounds" : {
}
}
That means that for creating the result set of that query, MongoDB had to touch 467560 entries, which is almost a full table scan, that took 528ms. Now after we apply the index on price as mentioned above, the explain() command gives us the following output:
{
"cursor" : "BtreeCursor price_1",
"nscanned" : 30,
"nscannedObjects" : 30,
"n" : 30,
"millis" : 1,
"indexBounds" : {
"price" : [
[
500,
1500
]
]
}
}
Suddenly, only 30 documents are touched (our result set), and the time drops down to 1ms.
What you can also see there, is which cursor MongoDB used to retrieve the result. In the first case it was the BasicCursor, in the second the BtreeCursor for our index price_1 (which means ascending). As a rule of a thumb, ensure that for your queries always a BtreeCursor is used.
Now we assume we do not want to sort the offers by price but by creation date, e.g. to see the newest offers on top. The query would then look like that:
> db.Offers.find({price: {$gte:500, $lte:1500}}).sort({date:1}).limit(30)
Output of the explain() command is:
{
"cursor" : "BtreeCursor price_1",
"nscanned" : 19720,
"nscannedObjects" : 19720,
"n" : 30,
"scanAndOrder" : true,
"millis" : 69,
"indexBounds" : {
"price" : [
[
500,
1500
]
]
}
}
Now we’re again on 69ms and had to touch 19720 documents. Before mongoberlin, I thought I could solve that by creating a second index on the date:
> db.Offer.ensureIndex({date:1})
After that, we get that output from explain()
{
"cursor" : "BtreeCursor date_1",
"nscanned" : 490,
"nscannedObjects" : 490,
"n" : 30,
"millis" : 3,
"indexBounds" : {
"date" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}
Time dropped down to 3ms, but we still need to touch 490 documents. And if we have a look at the cursor, we see that now BtreeCursor date_1 is used. That comes to another important thing to now: MongoDB always uses maximum 1 index for query execution. In our case, the index on the price is completely ignored. This may be ok for this query, but not for a bigger dataset.
Fortunately, you can use compound indexes at mongo like that:
> db.Offer.ensureIndex({date:1, price:1})
Please note: Guys at mongoberlin told us that in such a case the sort field always has to be the last one in the compound index, but when giving that a try, it turned out to be exactly the opposite. Maybe someone can clarify that?
Anyways, after applying that index, the query runs fast again as both fields are considered in the index:
{
"cursor" : "BtreeCursor date_1_price_1",
"nscanned" : 30,
"nscannedObjects" : 30,
"n" : 30,
"millis" : 1,
"indexBounds" : {
"date" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"price" : [
[
500,
1500
]
]
}
}
That’s it so far, I will now rework our db a bit to apply these techniques and maybe then post my experiences here again.

The {price:1, date:1} index could be used if you had an exact match on price rather than a range. Something like find({color:’red’}).sort({date:-1}) would be a good example. The issue is that the index can be used for sorting or selecting but not both, unless they are on the same field. The index can only be used for sorting if given a start point and an end point reading the index (conceptually like a sorted list) would yield documents in the correct order. In your example finding all objects with prices between 500 and 1500, they would not read be in the date order with a {price:1, date:1} index.
Things to look for in explain output:
1)$minkey/$maxkey ranges: This means that the index is not being used for queries on this field. If this is the first field in the index that means we are doing a table scan.
2) scanAndOrder: This means that the index is not being used for sorting. This is not terrible if there is a small limit and the query only matches a few objects. The runtime to find the top-k (top-30 here) objects out of n results is about n * log(k).
Hopefully that helps clear things up. Please get in touch if you have further questions.
Thanks Mathias for clarification. In the mongodb user group I posted another issue I came across when working on that example. It’s the performance of the count() operation.
Please check http://groups.google.com/group/mongodb-user/browse_thread/thread/f1a14d9096ed0c5d for that.
To put it short: The count() command on a query takes very long time if there’s a huge dataset behind, for the case that the number returned is very big (in my example it was about 400,000 of 500,000), even if there is an index on that field.
Is that reasonable or am I doing something wrong? Unfortunately I can’t view explain() on count()…