Notes to Self: Mongo Aggregation Pipelines and Filtering

Trying to deal with Mongo's lack of NULL LAST sorting.

gmb published on
5 min, 999 words

An image of several horizontal pipes, mostly parallel Photo by tian kuan on Unsplash

UPDATE There’s a follow-up to this with a better solution here

I’m going to start keeping track of little notes to self here on this blog. First, because it actually means I’m deliberately writing stuff on it, which, yay. Second, because it’s a good place to collate these things. Third, it might help someone else.

The problem

So, today’s problem: How to sort a collection of documents in Mongo by the size of one of the documents’ ListFields. Additionally, sort it so that those documents which effectively have a null value in that ListField (more on that in a moment) are always last in the result set.

Consider the following document (using mongoengine syntax, because that’s what I’m using right now):


class BlogTag(Document):

    title = StringField(required=True)
    slug = StringField(required=True)
    hidden = BooleanField()


class BlogPost(Document):

    title = StringField(required=True)
    body = StringField(required=True)
    tags = ListField(ReferenceField(BlogTag))
    draft = BooleanField()

Now, say we have an API endpoint which returns a list of BlogPosts, and allows us to order them by the number of tags. BlogPosts where draft=True will not show tags in this list, though they will show up. For those posts we’d expect to see a tag_count of None.

So for the set:


tags = [
    BlogTag.objects.create(
        title=f"Tag {i}",
        slug=f"tag_{i},
        hidden=True if i > 7 else False
    )
    for _ in range(10)
]


posts = [
    BlogPost.objects.create(
        title="Post 0",
        body="Lorem ipsum",
        tags=[tags[0]]   # Has 1 tag, but is a draft.
        draft=True,
    ),
    BlogPost.objects.create(
        title="Post 1",
        body="lorem ipsum",
        tags=tags[0:1],  # Has 2 tags.
    ),
    BlogPost.objects.create(
        title="Post 2",
        body="lorem ipsum",
        tags=tags[0:7] + [tags[5]],  # Has 8 tags.
    ),
    BlogPost.objects.create(
        title="Post 3",
        body="lorem ipsum",
        tags=tags[0:5] + tags[8:9],  # Has 8 tags, but two are hidden.
    ),
    BlogPost.objects.create(
        title="Post 4",
        body="lorem ipsum",
        tags=tags[0],  # Has 10 tags, but is a draft.
        draft=True,
    ),
]


… we’d expect to see the following when sorting by ascending tag_count.

[
    posts[1],
    posts[3],
    posts[2],

    # Effectively has 'None' tags.
    posts[0],
    posts[4],
]

And when sorting by descending tag_count, you’d get:

[
    posts[2],
    posts[3],
    posts[1],

    posts[0],
    posts[4],
]

Doing this with mongoengine

There are two problems here:

  1. How do we sort the collection by tag_count?
  2. How do we ensure that the nulls always appear last.

Nulls last

If you’re used to working with SQL, you might be Nelson Munsking at this point, saying: “but ha, does Mongo not have ORDER BY <FIELD> NULL LAST?”

Well, no, dear reader, it doesn’t, as it turns out. So we need to do a bit more work. It means doing two separate queries and then chain()ing them together:

non_draft_posts = # <code we're going to write in a minute>
draft_posts = BlogPost.objects(draft=True)
ordered_posts = itertools.chain(
    non_draft_posts,
    draft_posts,
)

Simple! There may be a more efficient way to do it, but it would make the Mongo query gnarlier and frankly I’m not in the mood at this point.

Sorting by tag_count

Now the meat of the matter.

For this we need to use Mongo’s aggregation pipeline. An aggregation pipeline consists of a set of stages which are applied sequentially, with the result of each stage being the collection of documents upon which the next stage acts.

For this problem, we need 3 stages:

  • A match stage, to actually find the documents that we want to sort (i.e. all the non-draft ones).
  • An addFields stage to add the tag_count field to each document, based on the number of non-hidden tags on the post.
  • A sort stage to sort by tag_count in the direction we choose.

$match

The match stage is easy, and looks like this (Python rather than JS syntax because we’re still going to use mongoengine to execute this):

{"$match": {"draft": false}}

$addFields

We need to add a field to each returned document, tag_count, which we’re then going to use for sorting. $addFields does this for us. But since we also only want to include non-hidden BlogTags in the count, we actually need to $filter the tags array.

So, our $addFields stage looks like this:

{
    "$addFields": {
        "tag_count": {
            "$size": {
                "$filter": {
                    "input": "$tags",
                    "as": "tag",
                    "cond": {"$not": ["$$tag.hidden"]},
                }
            }
        }
    },
}

What this translates to is:

  • Add a field called tag_count to each document
  • The tag_count field’s value will be the size of the array returned by filtering the tags on each post so that only the non-hidden tags are counted.

Note that we can’t use $eq here as a condition operator, because stuff gets kind of interesting when the hidden flag isn’t set at all. $eq is an equivalence operator, not an identity operator. Realistically, as long as hidden is falsey, we can assume that the tag is not hidden.

$sort

Similar to $match, the sort stage is simple too:

{"$sort": {"tag_count": 1 if order == "asc" else -1}}

Putting it all together

The final code looks something like this:

pipeline = [
    {"$match": {"draft": false}},
    {
        "$addFields": {
            "tag_count": {
                "$size": {
                    "$filter": {
                        "input": "$tags",
                        "as": "tag",
                        "cond": {"$not": ["$$tag.hidden"]},
                    }
                }
            }
        },
    },
    {"$sort": {"tag_count": 1 if order == "asc" else -1}},
]

non_draft_posts = BlogPost.objects.aggregate(*pipeline)
draft_posts = BlogPost.objects(draft=True)
ordered_posts = itertools.chain(
    non_draft_posts,
    draft_posts,
)

In conclusion

Mongo can be really annoying when you’re trying to do something useful, and this was way more complex than it should have been. But it is possible to do this mostly in the DB with only a little bit of work done in Python, which one should always prefer for the sake of performance.

Sadly, figuring all this out took me way longer than I would’ve liked; hopeful this will provide some help for future devs.