Mastering MongoDB: Top 5 Essential Aggregation Pipelines

Mastering MongoDB: Top 5 Essential Aggregation Pipelines

Unleashing the Power of Data Processing in MongoDB

Introduction

MongoDB is a leading NoSQL database with a schema-less nature, and the ability to handle large volumes of data for modern web applications.

Today MongoDB is been used in:

  • Streamline building AI-enriched applications.

  • Video Games that are global, reliable, and scalable.

  • Real-time analytical views of all the most important data.

  • Analyze and act on data from the physical world i.e. Internet of Things (IoT).

  • Store, edit, and present all types of content management.

Prerequisites

Basic knowledge of MongoDB. You can install MongoDB version 6.0 from here and can also use the MongoDB Atlas to try out these examples.

You can run the examples by using the MongoDB Playground.

What is Aggregation and its pipelines?

Aggregation:

MongoDB inherits capabilities to perform complex transformations and calculations, extracting meaningful information we need to aggregate data.

Aggregation simply refers to the process of collecting and processing data to produce a single consolidated output or result. E.g. calculating the sum, average, count, or other statistical measures from the dataset.

Aggregation Pipelines:

In MongoDB, the aggregation pipeline processes documents i.e. the data records, and returns computed results. A pipeline is made up of one or more stages, and each stage transforms the data i.e. documents as it moves through the pipeline.

The basic flow for any aggregation looks something like this as shown in the image below.

monogodb-aggregation-pipeline-diagram

This framework allows for flexible, complex, and powerful operations on data, enabling complex queries and data transformations to be performed directly in databases.

Top 5 Essential Aggregation Pipelines

The top 5 essential aggregation pipelines which every MongoDB developer should know are

  • $match

  • $group

  • $project

  • $lookup

  • $unwind

We will discuss each aggregation in detail - from prototype form i.e. syntax to examples in the following context. We will be using some sample datasets, so let's deep dive into this.

Sample Dataset

Let's imagine we are handling data for an e-commerce platform. Below is the sample dataset for the products and orders collection to understand the top 5 essential aggregation pipelines.

Sample Dataset for products collection:

// Products Collection
[
    {
        "_id": 101,
        "productName": "Laptop",
        "brand": "TechCorp",
        "specs": ["16GB RAM", "512GB SSD", "Intel i7"]
    },
    {
        "_id": 102,
        "productName": "Headphones",
        "brand": "AudioTech",
        "specs": ["Noise Cancelling", "Bluetooth", "8hr Battery"]
    },
    {
        "_id": 103,
        "productName": "Mouse",
        "brand": "TechCorp",
        "specs": ["Wireless", "Optical"]
    },
    {
        "_id": 104,
        "productName": "Keyboard",
        "brand": "TechWorks",
        "specs": ["Mechanical", "RGB Backlit"]
    }
]

Sample dataset for orders collection:

// Orders Collection
[
    {
        "_id": 1,
        "customerName": "Alice",
        "productId": 101,
        "quantity": 2,
        "price": 1000,
        "orderDate": "2023-08-01",
        "shippingCity": "New York"
    },
    {
        "_id": 2,
        "customerName": "Bob",
        "productId": 102,
        "quantity": 5,
        "price": 50,
        "orderDate": "2023-08-05",
        "shippingCity": "Los Angeles"
    },
    {
        "_id": 3,
        "customerName": "Charlie",
        "productId": 101,
        "quantity": 1,
        "price": 1000,
        "orderDate": "2023-08-05",
        "shippingCity": "New York"
    },
    {
        "_id": 4,
        "customerName": "Alice",
        "productId": 103,
        "quantity": 3,
        "price": 20,
        "orderDate": "2023-08-08",
        "shippingCity": "Boston"
    },
    {
        "_id": 5,
        "customerName": "Eve",
        "productId": 104,
        "quantity": 1,
        "price": 70,
        "orderDate": "2023-08-09",
        "shippingCity": "San Francisco"
    }
]

Please note: productId in orders collection refers to the product's _id

$match

Filters the documents to pass only the document that matches the specified condition(s) to the next pipeline stage.

The $match has the following prototype form:

{$match: { <query> } }

Now based on our sample data from orders collection, let's write a $match query to filter orders that are shipped from New York City.

Check the query below:

db.orders.aggregate([
{$match: { shippingCity: "New York" } }
)]

Based on the dataset of orders collection there will be two documents as a result. The result for the above aggregation pipeline will be as follows:

[
  {
    "_id": 1,
    "customerName": "Alice",
    "orderDate": "2023-08-01",
    "price": 1000,
    "productId": 101,
    "quantity": 2,
    "shippingCity": "New York"
  },
  {
    "_id": 3,
    "customerName": "Charlie",
    "orderDate": "2023-08-05",
    "price": 1000,
    "productId": 101,
    "quantity": 1,
    "shippingCity": "New York"
  }
]

$group

This stage separates documents into groups according to the "group key". The output is one document for each unique group key. A group key is often a field or group of fields e.g _id, productId

💡
$group does not order its output documents.

The $group stage has the following prototype form:

{
    $group: {
        _id: <expression>, // Group Key
        <field1> : { <accumulator1> : <expression1> }, // Optional
        ...
    }
}

Let's determine the total orders for each customer. Based on the orders collection this can be determined by the following query:

db.orders.aggregate([
{ 
    $group: {
        _id: "customerName",
        totalOrders: { $sum : 1 }
    }
}
])

The above query will give us the following result:

[
  {
    "_id": "Bob",
    "totalOrders": 1
  },
  {
    "_id": "Eve",
    "totalOrders": 1
  },
  {
    "_id": "Alice",
    "totalOrders": 2
  },
  {
    "_id": "Charlie",
    "totalOrders": 1
  }
]

You can see from the above results that the results are not in order and also only the _id i.e. customerName and the accumulators i.e. totalOrders are displayed as output.

$project

This aggregation is used to pass along the documents with the requested fields to the next stage in the pipeline. The specified fields can be existing fields from the input documents or newly computed fields.

The $project stage has the following prototype form:

{ $project: { <specification(s)> } }

Based on the orders collection, let's process the documents to get the total order value of each order.

In the orders collection, we have two fields price and quantity. Let's use the $multiply operator to get the total order value.

db.orders.aggregate([
{     
    $project : {
        orderValue: { $multiply: ['$quantity', 'price'] },
        customerName: 1
    } 
}
])

The output of the above query will be as follows: We can see from the results that only customerName, orderValue, together with _idfields are projected in the next pipeline stage.

[
  {
    "_id": 1,
    "customerName": "Alice",
    "orderValue": 2000
  },
  {
    "_id": 2,
    "customerName": "Bob",
    "orderValue": 250
  },
  {
    "_id": 3,
    "customerName": "Charlie",
    "orderValue": 1000
  },
  {
    "_id": 4,
    "customerName": "Alice",
    "orderValue": 60
  },
  {
    "_id": 5,
    "customerName": "Eve",
    "orderValue": 70
  }
]
💡
Suppression of the _id field can be done by using _id: 0 in the $project stage.

$lookup

One of the most important aggregations. If you have basic SQL knowledge, it performs a left outer join to a collection in the same database to filter in documents from the "joined" collection for processing.

left-outer-join

The $lookup stage adds a new array field to each input document. The new array field contains the matching documents from the "joined" collection.

The $lookup has three types of prototype forms:

  1. Equality Match with a Single Join collection: To perform an equality match between a field from the input documents with a field from the documents of the "joined" collection.

     {
        $lookup:
          {
            from: <collection to join>,
            localField: <field from the input documents>,
            foreignField: <field from the documents of the "from" collection>,
            as: <output array field>
          }
     }
    
  2. Join Conditions and Subqueries on a Joined Collection: To perform correlated and uncorrelated subqueries with two collections, and perform other join conditions besides a single equality match, use the below syntax.

     {
        $lookup:
           {
              from: <joined collection>,
              let: { <var_1>: <expression>, …, <var_n>: <expression> },
              pipeline: [ <pipeline to run on joined collection> ],
              as: <output array field>
           }
     }
    
  3. Correlated Subqueries Using Concise Syntax: The below syntax is used which removes the requirement for an equality match on the foreign local fields inside of an $expr operator:

     {
        $lookup:
           {
              from: <foreign collection>,
              localField: <field from local collection's documents>,
              foreignField: <field from foreign collection's documents>,
              let: { <var_1>: <expression>, …, <var_n>: <expression> },
              pipeline: [ <pipeline to run> ],
              as: <output array field>
           }
     }
    

Let's write a query to join products and orders using $lookup. We are using the reference id productId from the orders collection

db.orders.aggregate([
    {
        $lookup: {
            from: "products",
            localField: "productId",
            foreignField: "_id",
            as: "productDetails"
        }
    }
]);

The above query will display the result as follows:

[
  {
    "_id": 1,
    "customerName": "Alice",
    "orderDate": "2023-08-01",
    "price": 1000,
    "productDetails": [
      {
        "_id": 101,
        "brand": "TechCorp",
        "productName": "Laptop",
        "specs": [
          "16GB RAM",
          "512GB SSD",
          "Intel i7"
        ]
      }
    ],
    "productId": 101,
    "quantity": 2,
    "shippingCity": "New York"
  },
  {
    "_id": 2,
    "customerName": "Bob",
    "orderDate": "2023-08-05",
    "price": 50,
    "productDetails": [
      {
        "_id": 102,
        "brand": "AudioTech",
        "productName": "Headphones",
        "specs": [
          "Noise Cancelling",
          "Bluetooth",
          "8hr Battery"
        ]
      }
    ],
    "productId": 102,
    "quantity": 5,
    "shippingCity": "Los Angeles"
  },
  {
    "_id": 3,
    "customerName": "Charlie",
    "orderDate": "2023-08-05",
    "price": 1000,
    "productDetails": [
      {
        "_id": 101,
        "brand": "TechCorp",
        "productName": "Laptop",
        "specs": [
          "16GB RAM",
          "512GB SSD",
          "Intel i7"
        ]
      }
    ],
    "productId": 101,
    "quantity": 1,
    "shippingCity": "New York"
  },
  {
    "_id": 4,
    "customerName": "Alice",
    "orderDate": "2023-08-08",
    "price": 20,
    "productDetails": [
      {
        "_id": 103,
        "brand": "TechCorp",
        "productName": "Mouse",
        "specs": [
          "Wireless",
          "Optical"
        ]
      }
    ],
    "productId": 103,
    "quantity": 3,
    "shippingCity": "Boston"
  },
  {
    "_id": 5,
    "customerName": "Eve",
    "orderDate": "2023-08-09",
    "price": 70,
    "productDetails": [
      {
        "_id": 104,
        "brand": "TechWorks",
        "productName": "Keyboard",
        "specs": [
          "Mechanical",
          "RGB Backlit"
        ]
      }
    ],
    "productId": 104,
    "quantity": 1,
    "shippingCity": "San Francisco"
  }
]

You can see from the above response that an additional field productDetails is added which itself is an array and includes the matching documents based on the "joined" collection.

$unwind

This aggregation deconstructs an array field from the input documents to output a document for each element.

Each output document is the input document with the value of the array field replaced by the element.

The prototype form used for $unwind aggregation is as follows:

{
  $unwind:
    {
      path: <field path>,
      includeArrayIndex: <string>, // Optional
      preserveNullAndEmptyArrays: <boolean> // Optional, default value: false
    }
}

Let's use $unwind aggregation on the Product specs field of products collection: The syntax for the query will be as follows:

db.products.aggregate([
    {
        $unwind: {path:  "$specs" }
    }
]);

The response to using $unwind will be deconstructed form of specs array field:

[
  {
    "_id": 101,
    "brand": "TechCorp",
    "productName": "Laptop",
    "specs": "16GB RAM"
  },
  {
    "_id": 101,
    "brand": "TechCorp",
    "productName": "Laptop",
    "specs": "512GB SSD"
  },
  {
    "_id": 101,
    "brand": "TechCorp",
    "productName": "Laptop",
    "specs": "Intel i7"
  },
  {
    "_id": 102,
    "brand": "AudioTech",
    "productName": "Headphones",
    "specs": "Noise Cancelling"
  },
  {
    "_id": 102,
    "brand": "AudioTech",
    "productName": "Headphones",
    "specs": "Bluetooth"
  },
  {
    "_id": 102,
    "brand": "AudioTech",
    "productName": "Headphones",
    "specs": "8hr Battery"
  },
  {
    "_id": 103,
    "brand": "TechCorp",
    "productName": "Mouse",
    "specs": "Wireless"
  },
  {
    "_id": 103,
    "brand": "TechCorp",
    "productName": "Mouse",
    "specs": "Optical"
  },
  {
    "_id": 104,
    "brand": "TechWorks",
    "productName": "Keyboard",
    "specs": "Mechanical"
  },
  {
    "_id": 104,
    "brand": "TechWorks",
    "productName": "Keyboard",
    "specs": "RGB Backlit"
  }
]

Best Practices

Like any other high-performance tool, MongoDB also performs best when it is used properly. Here are some of the best practices for making your MongoDB perform best.

  1. Schema Design: Use embedded documents for data that are used more often for read operations. Use of references for data that won't be accessed every time.

  2. Indexes: Always index fields that are often required, for e.g productId in the orders collection.

  3. Aggregation: Use $match early to reduce the amount of data that subsequent stages need to process.

  4. Data Modelling: Avoid the usage of large arrays that will grow over time. For e.g Product Specs

  5. Limit the Result: Use $limit the operator to limit the result from large datasets, wherever applicable.

More Aggregation Queries to Try Out

Do you want to try out more on aggregation, below are some of the queries for you to try out at your end.

  1. Calculate Total Sales for Each Product Spec.

  2. Group by date to find the total sales for each day.

  3. Filter orders that exceed a certain price (let's assume $500).

  4. Find out the total revenue from each city.

  5. Sort customers based on the number of items they purchased.

Let me know the results in the comments section, you can also look into the solutions provided by me, check the References section, or click here.

How to run the examples

You can run the examples by using the MongoDB Playground. You can follow the instructions here to get started with MongoDB Playground.

Conclusion

MongoDB is a NoSQL document database for modern developers working on high-performance applications. With its JSON-like documents, MongoDB allows developers to model data however the application requires. With nesting arrays and embedded sub-documents, it allows the modeling of complex relationships between data.

I hope you have learned something new as I did. If so, kindly like and share the article and also follow me to read more exciting articles. You can check my social links here.

References

MongoDB Docs

Use MongoDB Playground to run the examples in your browser without installing anything.

You can find the sample datasets and also the solutions to examples in my repo https://github.com/amanpreet-dev/essential-aggregation-pipelines

Do not forget to give a star for the repo and also follow me on GitHub