Problem Statement
What are the best practices for optimizing aggregation pipeline performance?
Explanation
First, place dollar match stages as early as possible to filter documents before expensive operations. This reduces the data volume flowing through the pipeline. Use indexes for dollar match and dollar sort stages by ensuring indexed fields are used in these operations.
Second, limit the data processed by using dollar project or dollar addFields to remove unnecessary fields early in the pipeline. Smaller documents flow faster through stages. Avoid using dollar lookup when possible, as joins are expensive. Consider embedding data instead of referencing.
Third, use dollar limit immediately after dollar sort to prevent sorting the entire result set. MongoDB can optimize this combination to use a top-K sort algorithm, which is much faster than sorting everything then limiting.
Fourth, use allowDiskUse option for large aggregations that exceed the memory limit of 100 MB. This allows MongoDB to write temporary files to disk, though it is slower than in-memory processing. Fifth, analyze your pipeline with explain method to see which stages are slow and whether indexes are being used.
Sixth, for very large aggregations, consider using dollar merge or dollar out to write results to a collection, then query that collection. This is useful for reports that do not need real-time data.
Code Solution
SolutionRead Only
// Optimized pipeline
db.orders.aggregate([
// 1. Filter early with indexed field
{ $match: { status: "completed", date: { $gte: ISODate("2024-01-01") } } },
// 2. Project only needed fields
{ $project: { customerId: 1, amount: 1, _id: 0 } },
// 3. Group and calculate
{ $group: { _id: "$customerId", total: { $sum: "$amount" } } },
// 4. Sort with limit (uses top-K sort)
{ $sort: { total: -1 } },
{ $limit: 100 }
], {
allowDiskUse: true // For large datasets
})
// Create index to support pipeline
db.orders.createIndex({ status: 1, date: 1 })