Notes to Self: Mongo Aggregation Pipelines and Filtering
Trying to deal with Mongo's lack of NULL LAST sorting.
Photo by tian kuan on Unsplash
UPDATE There’s a follow-up to this with a better solution here
I’m going to start keeping track of little notes to self here on this blog. First, because it actually means I’m deliberately writing stuff on it, which, yay. Second, because it’s a good place to collate these things. Third, it might help someone else.
The problem
So, today’s problem: How to sort a collection of documents in Mongo by the
size of one of the documents’ ListFields
. Additionally, sort it so that
those documents which effectively have a null
value in that ListField
(more on that in a moment) are always last in the result set.
Consider the following document (using mongoengine syntax, because that’s what I’m using right now):
class BlogTag(Document):
title = StringField(required=True)
slug = StringField(required=True)
hidden = BooleanField()
class BlogPost(Document):
title = StringField(required=True)
body = StringField(required=True)
tags = ListField(ReferenceField(BlogTag))
draft = BooleanField()
Now, say we have an API endpoint which returns a list of BlogPosts
, and
allows us to order them by the number of tags. BlogPosts
where draft=True
will not show tags in this list, though they will show up. For those posts
we’d expect to see a tag_count
of None
.
So for the set:
tags = [
BlogTag.objects.create(
title=f"Tag {i}",
slug=f"tag_{i},
hidden=True if i > 7 else False
)
for _ in range(10)
]
posts = [
BlogPost.objects.create(
title="Post 0",
body="Lorem ipsum",
tags=[tags[0]] # Has 1 tag, but is a draft.
draft=True,
),
BlogPost.objects.create(
title="Post 1",
body="lorem ipsum",
tags=tags[0:1], # Has 2 tags.
),
BlogPost.objects.create(
title="Post 2",
body="lorem ipsum",
tags=tags[0:7] + [tags[5]], # Has 8 tags.
),
BlogPost.objects.create(
title="Post 3",
body="lorem ipsum",
tags=tags[0:5] + tags[8:9], # Has 8 tags, but two are hidden.
),
BlogPost.objects.create(
title="Post 4",
body="lorem ipsum",
tags=tags[0], # Has 10 tags, but is a draft.
draft=True,
),
]
… we’d expect to see the following when sorting by ascending tag_count
.
[
posts[1],
posts[3],
posts[2],
# Effectively has 'None' tags.
posts[0],
posts[4],
]
And when sorting by descending tag_count
, you’d get:
[
posts[2],
posts[3],
posts[1],
posts[0],
posts[4],
]
Doing this with mongoengine
There are two problems here:
- How do we sort the collection by
tag_count
? - How do we ensure that the nulls always appear last.
Nulls last
If you’re used to working with SQL, you might be Nelson Munsking at this
point, saying: “but ha, does Mongo not have ORDER BY <FIELD> NULL LAST
?”
Well, no, dear reader, it doesn’t, as it turns out. So we need to do a bit
more work. It means doing two separate queries and then chain()
ing them
together:
non_draft_posts = # <code we're going to write in a minute>
draft_posts = BlogPost.objects(draft=True)
ordered_posts = itertools.chain(
non_draft_posts,
draft_posts,
)
Simple! There may be a more efficient way to do it, but it would make the Mongo query gnarlier and frankly I’m not in the mood at this point.
Sorting by tag_count
Now the meat of the matter.
For this we need to use Mongo’s aggregation pipeline. An aggregation pipeline consists of a set of stages which are applied sequentially, with the result of each stage being the collection of documents upon which the next stage acts.
For this problem, we need 3 stages:
- A
match
stage, to actually find the documents that we want to sort (i.e. all the non-draft ones). - An
addFields
stage to add thetag_count
field to each document, based on the number of non-hidden tags on the post. - A s
ort
stage to sort bytag_count
in the direction we choose.
$match
The match stage is easy, and looks like this (Python rather than JS syntax because we’re still going to use mongoengine to execute this):
{"$match": {"draft": false}}
$addFields
We need to add a field to each returned document, tag_count
, which we’re
then going to use for sorting. $addFields
does this for us. But since we
also only want to include non-hidden BlogTags
in the count, we actually
need to $filter
the tags
array.
So, our $addFields
stage looks like this:
{
"$addFields": {
"tag_count": {
"$size": {
"$filter": {
"input": "$tags",
"as": "tag",
"cond": {"$not": ["$$tag.hidden"]},
}
}
}
},
}
What this translates to is:
- Add a field called
tag_count
to each document - The
tag_count
field’s value will be the size of the array returned by filtering the tags on each post so that only the non-hidden tags are counted.
Note that we can’t use $eq
here as a condition operator, because stuff gets
kind of interesting when the hidden
flag isn’t set at all. $eq
is an
equivalence operator, not an identity operator. Realistically, as long as
hidden
is falsey, we can assume that the tag is not hidden.
$sort
Similar to $match
, the sort stage is simple too:
{"$sort": {"tag_count": 1 if order == "asc" else -1}}
Putting it all together
The final code looks something like this:
pipeline = [
{"$match": {"draft": false}},
{
"$addFields": {
"tag_count": {
"$size": {
"$filter": {
"input": "$tags",
"as": "tag",
"cond": {"$not": ["$$tag.hidden"]},
}
}
}
},
},
{"$sort": {"tag_count": 1 if order == "asc" else -1}},
]
non_draft_posts = BlogPost.objects.aggregate(*pipeline)
draft_posts = BlogPost.objects(draft=True)
ordered_posts = itertools.chain(
non_draft_posts,
draft_posts,
)
In conclusion
Mongo can be really annoying when you’re trying to do something useful, and this was way more complex than it should have been. But it is possible to do this mostly in the DB with only a little bit of work done in Python, which one should always prefer for the sake of performance.
Sadly, figuring all this out took me way longer than I would’ve liked; hopeful this will provide some help for future devs.