Indexing and Aggregation in MongoDB Tutorial
4.1 Indexing and Aggregation
Hello and welcome to Lesson 4 of the Mongo DB administrator course offered by Simplilearn. This lesson will explain how to create and manage different types of indexes in MongoDB to execute queries faster. Let us explore the objectives of this lesson in the next screen.
After completing this lesson, you will be able to: • Explain how to create unique, compound, sparse, text, and geospatial indexes in MongoDB • Explain the process of checking the indexes used by MongoDB when retrieving the documents form the database • Identify the steps to create, remove, and modify indexes • Explain how to manage indexes by listing, modifying, and dropping • Identify different kinds of aggregation tools available in MongDB • Explain how to use MapReduce to perform complex aggregation operations in MongoDB We will begin with an introduction to Indexing in the next screen.
4.3 Introduction to Indexing
Typically, Indexes are data structures that can store collection’s data set in a form that is easy to traverse. Queries are efficiently executed with the help of indexes in MongoDB. Indexes help MongoDB find documents that match the query criteria without performing a collection scan. If a query has an appropriate index, MongoDB uses the index and limits the number of documents it examines. Indexes store field values in the order of the value. The order in which the index entries are made support operations, such as equality matches and range-based queries. MongoDB sorts and returns the results by using the sequential order of the indexes. The indexes of MongoDB are similar to the indexes in any other databases. MongoDB defines the indexes at the collection level for use in any field or subfield. In the next screen, we will discuss index types.
4.4 Types of Index
MongoDB supports the following index types for querying. Default _id (read as default underscore id) Each MongoDB collection contains an index on the default _id (Read as underscore id) field. If no value is specified for _id, the language driver or the mongod (read as mongo D) creates a _id field and provides an ObjectId (read as Object ID) value. Single Field For a single-field index and sort operation, the sort order of the index keys do not matter. MongoDB can traverse the indexes either in the ascending or descending order. Compound Index For multiple fields, MongoDB supports user-defined indexes, such as compound indexes. The sequential order of fields in a compound index is significant in MongoDB. Multikey Index To index array data, MongoDB uses multikey indexes. When indexing a field with an array value, MongoDB makes separate index entries for each array element. Geospatial Index To query geospatial data, MongoDB uses two types of indexes—2d indexes (read as two D indexes) and 2d sphere (read as two D sphere) indexes. Text Indexes These indexes in MongoDB searches data string in a collection. Hashed Indexes MongoDB supports hash based sharding and provides hashed indexes. These indexes the hashes of the field value. We will discuss the index types in detail later in the lesson. In the next screen, we will discuss the index properties.
4.5 Properties of Index
Following are the index properties of MongoDB. Unique Indexes The unique property of MongoDB indexes ensure that duplicate values for the indexed field are rejected. In addition, the unique indexes can be interchanged functionally with other MongoDB indexes. Sparse Indexes This property ensures that queries search document entries having indexed field. Documents without indexed fields are skipped during a query. Sparse index and the unique index can be combined to reject documents with duplicate field values and ignore documents without indexed keys. Total time to Live or TTL Indexes These are special indexes in MongoDB used to automatically delete documents from a collection after specified duration of time. This is ideal for deleting information, such as machine generated data, event logs, and session data that needs to be in the database for shorter duration. In the next screen, we will discuss Single field Index.
4.6 Single Field Index
MongoDB supports indexes on any document field in a collection. By default, the _id field in all collections have indexes. Moreover, applications and users add indexes for triggering queries and performing operations. MongoDB supports both, single field or multiple field indexes based on the operations the index-type performs. The command given on the screen is used to create an index on the item field for the items collection. In the next screen, we will discuss how to create single field indexes on embedded documents
4.7 Single Field Index on Embedded Document
You can index top level fields within a document. Similarly, you can create indexes within embedded document fields. The structure shown on the screen refers to a document stored in a collection. In the document, the details field depicts an embedded document that has two embedded fields— ISDN and publisher. To create an index on the ISDN field and the embedded document called “details”, perform the queries shown on the screen. In the next screen, we will discuss compound indexes.
4.8 Compound Indexes
MongoDB supports compound indexes to query multiple fields. A compound index contains multiple single field indexes separated by a comma. The command shown on the screen is an example of a compound index on two fields. This diagram depicts a compound index for the fields, userid and score. The documents are first organized by userid and within each userid, scores are organized in the descending order. The sort order of fields in a compound index is crucial. The documents are first sorted by the item field value and then, within each item field value, they are further sorted by the stock field values. For a compound index, MongoDB limits the fields to a maximum of 31. In the next screen, we will discuss Index prefixes.
4.9 Index Prefixes
Index prefixes are created by taking different combination of fields and typically, start from the first field. For example, consider the compound index given on the screen. It has item in the ascending order and available in the ascending order as the index prefixes. MongoDB uses a compound index even if the find queries are for index prefixes fields. It uses indexes for querying the item field, the available field, and the soldQty (read as sold quantity) field. MongoDB cannot efficiently support the query on the item and soldQty fields by using index prefixes as it would be like using separate indexes for these fields. The item field is a part of the compound index and the index prefixes. Hence, the item field should be used in the find query of the index. We will discuss Sort Order in the next screen.
4.10 Sort Order
In MongoDB, you can use the sort operations to manage the sort order. You can retrieve documents based on the sort order in an index. If you are unable to obtain the documents sorted from an index, the results will get sorted in the memory. Sort operations executed using an index show better performance than those executed without using an index. In addition, sort operations performed without an index gets terminated after exhausting 32 megabytes of memory. Typically, indexes store field references in the ascending or descending sort order. For single-field indexes, MongoDB can traverse the index in either direction, hence the sort order is not important. However, for compound indexes, the sort order is important because it helps determine if the index can support a sort operation. In the next screen, we will discuss how to ensure that indexes fit in the Random Access Memory or RAM.
4.11 Ensure Indexes Fit RAM
To process query faster, ensure that your indexes fit into your system RAM. This will help the system avoid reading the indexes from the hard disk. To confirm the index size, use the query given on the screen. This returns the data in bytes. To ensure this index fits your RAM, you must have more than the required RAM available. In addition, you must have RAM available for the rest of the working set. For multiple collections, check the size of all indexes across all collections. The indexes and the working sets both must fit in the RAM simultaneously. In the next screen, we will discuss multikey indexes.
4.12 Multi-Key Indexes
When indexing a field containing an array value, MongoDB creates separate index entries for each array component. These multikey indexes in queries match array elements with documents containing arrays and select them. You can construct multikey indexes for arrays holding scalar values, such as strings, numbers, and nested documents. To create a multikey index, you can use the db.collection.createIndex() (read as D-B dot collection dot create Index) method given on the screen. If the indexed field contains an array, MongoDB automatically decides to either create a multikey index or not create one. You need not specify the multikey type explicitly. In the next screen, we will discuss Compound multikey indexes.
4.13 Compound Multi-Key Indexes
In compound multikey indexes, each indexed document can have maximum one indexed field with an array value. If more than one field has an array value, you cannot create a compound multikey index. An example of a document structure is shown on the screen. In this collection, both the product_id (read as product underscore ID) and retail_id (read as retail underscore ID) fields are arrays. Therefore, you cannot create a compound multikey index. Note that a shard key index and a hashed index cannot be a multikey index. In the next screen, we will discuss hashed indexes in detail.
4.14 Hashed Indexes
The hashing function combines all embedded documents and computes hashes for all field values. However, it does not support multi-key indexes. Hashed indexes support sharding. These indexes use a hashed shard key to shard a collection. This ensures an even distribution of data. MongoDB uses hashed indexes to support equality queries, however, range queries are not supported. You cannot create unique or compound index by taking a field whose type is hashed. However, you can create a hashed and non-hashed index for the same field. MongoDB uses the scalar index for range queries. You can create a hashed index using the operation given on the screen. This will create a hashed index for the items collection on the item field. In the next screen, we will discuss TTL indexes in detail.
4.15 TTL Indexes
TTL indexes automatically delete machine generated data. You can create a TTL index by combining the db.collection.createIndex() (read as D-B dot collection dot create index) method with the expireAfterSeconds (read as expire after seconds) option on a field whose value is either a date or an array that contains date values. For example, to create a TTL index on the lastModifiedDate (read as last modified date) field of the eventlog collection, use the operation shown on the screen in the mongo shell. The TTL background thread runs on both primary and secondary nodes. However it deletes documents only from the primary node. TTL indexes have the following limitations. • They are not supported by compound indexes which ignores expireAfterSeconds • The _id field does not support TTL indexes. • TTL indexes cannot be created on a capped collection because MongoDB cannot delete documents from a capped collection. • It does not allow the createIndex()(read as create index) method to change the value of expireAfterSeconds of an existing index. You cannot create a TTL index for a field if a non-TTL index already exist for the same field. If you want to change a non-TTL single-field index to a TTL index, first drop the index and recreate the index with the expireAfterSeconds option. In the next screen, we will be discussing creating unique indexes.
4.16 Unique Indexes
To create a unique index, use the db.collection.createIndex() method and set the unique option to true. For example, to create a unique index on the item field of the items collection, execute the operation shown on the screen in the mongo shell. By default, unique is false on MongoDB indexes. If you use the unique constraint on the compound index, then MongoDB will enforce uniqueness on the combination of all those fields which were the part of the compound key. Unique Index and Missing Field: If the indexed field in a unique index has no value, the index stores a null value for the document. Because of this unique constraint, MongoDB permits only one document without the indexed field. In case there are more than one document with a valueless or missing indexed field, the index build process will fail and will display a duplicate key error. To filter these null values and avoid error, combine the unique constraint with the sparse index. In the next screen, we will discuss sparse indexes.
4.17 Sparse Indexes
Sparse indexes manage only those documents which have indexed fields, even if that field contains null values. Sparse index ignores those documents which do not contain any index field. Non-sparse indexes do not ignore these documents and store null values for them. To create a sparse index, use the db.collection.createIndex() method and set the sparse option to true. In the example given on the screen, the operation in the mongo shell creates a sparse index on the item field of the items collection. If a sparse index returns an incomplete index, then MongDB does not use that index unless it is specified in the hint method. For example, the second command given on the screen will not use a sparse index on the x field unless it receives explicit hints. An index that combines both sparse and unique does not allow collection to include documents having duplicate field values for a single field. However, it allows multiple documents that omit the key. In the next screen, we will view a demo on how to create compound, sparse, and unique indexes.
4.18 Demo—Create Compound, Sparse, and Unique Indexes
This demo will show the steps to create compound, sparse, and unique indexes in MongoDB. Click the demo icon to view the demo.
4.20 Text Indexes
Text indexes in MongoDB helps search for text strings in documents of a collection. You can create a text index for field or fields containing string values or an array of strings. To access text indexes, trigger a query using the $text (read as text) query operator. When you create text indexes for multiple fields, specify the individual fields or use the wildcard specifier ($**) (read as dollar star star). To create text indexes on the subject and content fields, perform the query given on the screen. The text index organizes all strings in the subject and content field, where the field value is either a string or an array of string elements. To allow text search for all fields with strings, use the wildcard specifier ($**)(read as dollar star star). This indexes all fields containing string content. The second example given on the screen indexes any string value available in each field of each document in a collection and names the indexes as TextIndex. In the next screen, we will view a demo on how to create single field and text indexes in MongoDB.
4.21 Demo—Create Single Field and Text Index
This demo will show the steps to create single field and text indexes in MongoDB. Click the demo icon to view the demo.
4.23 Text Search
MongoDB supports various languages for text search. The text indexes use simple language-specific suffix stemming instead of language-specific stop words, such as “the”, “an”, “a”, “and”. You can also choose to specify a language for text search. If you specify the language value as "none", then the text index uses simple tokenization without any stop word and stemming. In the query given on the screen, you are enabling the text search option for the item field of the customer_info (read as customer underscore info) collection with Spanish as the default language. If the index language is English, text indexes are case-insensitive for all alphabets from A to Z. The text index and the $text operator supports the following: • Two-letter language codes defined in ISO 639-1 (read as I-S-O 6-3-9-1). • Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish Note that a compound text index cannot include special index types, such as multi-key or geospatial Index fields. In the next screen, we will discuss index creation.
4.24 Index Creation
MongoDB provides several options to create indexes. By default, when indexes are created, all other operations on a database are blocked. For example, when indexes on a collection are created, the database becomes unavailable for any read or write operation until the index creation process completes. The read or write operations on the database queue and allow the index building process to complete. Therefore, for index building operations which may consume longer time, you can consider the background operation and thus make MongoDB available even during the entire operation. The command given on the screen is used for this purpose. By default, background is false for building MongoDB indexes. We will discuss index creation further in the next screen.
4.25 Index Creation (contd.)
When MongoDB is creating indexes in the background for a collection, you cannot perform other administrative operations involving that collection. For example, you cannot perform tasks, such as runrepairDatabase, (read as run repair database) drop the collection, or use the query db.collection.drop(),(read as D-B dot collection dot drop) and runcompact (read as run compact). If you perform any of these operations, you will receive an error. The index build process at the background uses an incremental approach and is slower than the normal “foreground” index build process. The speed of the index build process depends on the size of the index. If the index size is bigger than the RAM of the system, the process takes more time than the foreground process. Building indexes can impact your database performance: • If the application includes createIndex()(read as create index) operations and • If no index is available for operational concerns. To avoid any performance issues, you can use the getIndexes()(read as det indexes) method to ensure that your application checks for the indexes at the start up. You can also use an equivalent method for your driver and ensure it terminates an operation if the proper indexes do not exist. When building indexes, use separate application codes and designated maintenance windows. We will discuss how to create indexes on replica sets in the next screen.
4.26 Index Creation on Replica Set
Typically, background index operations on a secondary replica set begin after the index building process completes in the primary. If the index build process is running in background on the primary, then same will happen on the secondary nodes as well. If you want to build large indexes on secondaries, you can build the index by restarting one secondary at a time in a standalone mode. After the index build is complete, restart as a member of the replica set, allow it to catch up with the other members of the set, and then build the index on the next secondary. When all the secondaries have the new index, step down the primary, restart it as a standalone, and build the index on the former primary. To ensure that the secondary catch up with primary, the time taken to build the index on a secondary must be within an oplog. To catch up with primary node, index creation on secondary nodes always happen in the foreground in the “recovering” mode. Instead of using the default name, you can specify a name for the index by using the command given on the screen. This will create an index on the item field whose name will be item_index (read as item underscore index) for the customer_info (read as customer underscore info) collection In the next screen, we will discuss how to remove indexes.
4.27 Remove Indexes
You can use the following methods to remove indexes. dropIndex()(read as drop index) method : This removes an index from a collection. db.collection.dropIndex()(read as D-B dot collection dot drop index) method: This removes an index. For example, the first operation given on the screen removes an ascending index on the item field in the items collection. To remove all indexes barring the _id index from a collection, use the second operation provided on the screen. In the next screen, we will discuss how to modify an index.
4.28 Modify Indexes
To modify an index, first drop the index and then recreate it. Perform the following steps to modify an index.
4.29 Demo—Drop and Index from a Collection
Execute the first query given on the screen to return a document showing the operation status. If the operation is successful, the ok field in the returned document will display numeric 1.
4.30 Demo—Drop and Index from a Collection
Execute the second query given on the screen to return a document showing the status of the results. For example, if the operation is successful, the returned document shows numIndexesAfter (read as number indexes after) as greater than numIndexesBefore (read as number indexes Before) by one. In the next screen, we will view a demo on how to drop an index. This demo will show you the steps to drop indexes for a collection. Click the demo icon to view the demo.
4.31 Rebuild Indexes
In addition to modifying indexes, you can also rebuild them. To rebuild all indexes of a collection, use the db.collection.reIndex() (read as D-B dot collection dot re index) method. This will drop all indexes including _id and rebuild all indexes in a single operation. The operation takes the form db.items.reIndex(). To view the indexing process status, type the db.currentOp() (read as D B dot Current operation) command in the mongo shell. The message field will show the percentage of the build completion. To abort an ongoing index build process, use the db.killOp()(read as D B dot kill operation) method in the mongo shell. For index builds, the db.killOp()may occur after most of the index build operation has completed. Note that a replicated index built on the secondary replica set cannot be aborted. In the next screen, we will discuss Listing Indexes.
4.32 Listing Indexes
You can list all indexes of a collection and a database. You can get a list of all indexes of a collection by using the db.collection.getIndexes()or a similar method for your drivers. For example, to view all indexes on the items collection, use the db.items.getIndexes() method. To list all indexes of collections, you can use the operation in the mongo shell as shown on the screen In the next screen, we will view a demo on retrieving indexes for a collection and database.
4.33 Demo—Retrieve Indexes for a Collection and Database
This demo will show you the steps to retrieve indexes for a collection and a whole database function. Click the demo icon to view the demo.
4.35 Measure Index Use
Typically, query performance indicates an index usage. MongoDB provides a number of tools to study query operations and observe index use for your database. The explain() method can be used to print information about query execution. The explain method returns a document that explains the process and indexes used to return a query. This helps optimize a query. Using the db.collection.explain() or the cursor.explain() method helps measure index usages. In the next screen, we will view a demo on using mongo shell methods to monitor index usage.
4.36 Demo—Use Mongo Shell Methods to Monitor Indexes
This demo will show you the steps to use different mongo shell methods to monitor the usage of indexes. Click the demo icon to view the demo.
4.38 Control Index Use
In case you want to force MongoDB to use particular indexes for querying documents, then you need to specify the index with the hint() method. The hint method can be appended in the find() method as well. Consider the example given on the screen. This command queries a document whose item field value is “Book” and available field is “true”. Here, MongoDB’s query planner is directed to use the index created on the item field. To view the execution statistics for a specific index, use the explain method in the find command. For example, consider the queries given on the screen. If you want to prevent MongoDB from using any index, specify the $natural (read as natural) operator to the hint() method. For example, use the following query given on the screen. In the next screen, we will view a demo on using operators, explain, hint, and natural to create an index.
4.39 Demo—Use the Explain, $Hint and $Natural Operators to Create Index
This demo will show you the steps to use the operators, explain, hint, and natural to create an index. Click the demo icon to view the demo.
4.41 Index Use Reporting
MongoDB provides different metrics to report index use and operation. You can consider these metrics when analyzing index use for your database. These metrics are printed using the following commands. serverStatus prints two metrics scanned and scanAndOrder (read as Scan and order) • Scanned displays the documents that MongoDB scans in the index to carry out the operation. If the number of scanned document is higher than number of returned documents, this indicates that the database has scanned many objects to find the target object. In such cases, consider creating an index to improve this. • scanAndOrder is a boolean that is true when a query cannot use the order of documents in the index for returning sorted results. MongoDB must sort the documents after it receives the documents from a cursor. If scanAndOrder is false, MongoDB can use the order of the documents in an index to return the sorted results. collStats (read as collection statistics) prints the two metrics: • totalIndexSize, which returns index size in bytes. • indexSizes explains the size of the data allocated for an index • dbStats (read as database statistics) has the following two metrics. • dbStats.indexes: Contains a count of the total number of indexes across all collections in the database. • dbStats.indexSize: The total size in bytes of all indexes created on this database. In the next screen, we will discuss geospatial index.
4.42 Geospatial Index
With the increased usage of handheld devices, geospatial queries are becoming increasingly frequent for finding the nearest data points for a given location. MongoDB provides geospatial indexes for coordinating such queries. Suppose you want to find the nearest coffee shop from your current location. You need to create a special index to efficiently perform such queries because it needs to search in two dimensions— longitude and latitude. A geospatial index is created using the createIndex function. It passes "2d" (read as two D) or “2dsphere”as a value instead of 1 (read as one) or -1(read as minus one). To query geospatial data, you first need to create geospatial index. In the index specification document for the db.collection.createIndex() method as shown on the screen, specify the location field as the index key and specify the string literal "2dsphere" as the value. A compound index can include a 2dsphere index key in combination with non-geospatial index keys. In the next screen, we will view a demo on how to create geospatial indexes.
4.43 Demo—Create Geospatial Index
This demo will show you the steps to create geospatial indexes in MongoDB. Click the demo icon to view the demo.
4.45 MongoDB’s Geospatial Query Operators
The geospatial query operators in MongoDB lets you perform the following queries. Inclusion queries return the locations included within a specified polygon. The inclusion queries use the operator $geoWithin (read as geo within). The 2d and 2dsphere indexes support this query. Although MongoDB does not require any index to perform an inclusion query, they can enhance the query performance. Intersection queries return locations intersecting with a specified geometry. These queries use the $geoIntersects (read as geo intersects) operator and return the data on a spherical surface. Only 2dsphere indexes support intersection. Proximity queries return various points closer to a specified point. Proximity queries use the $near (read as near) operator that requires a 2d or 2dsphere index. In the next screen, we will view a demo on using geospatial indexes in a query.
4.46 Demo—Use Geospatial Index in a Query
This demo will show you the steps to use geospatial indexes in the find query of MongoDB. Click the demo icon to view the demo. Slide 40: $geoWith Operator The $geoWithin (read as geo within) operator is used to query location data found within a GeoJSON (read as geo J-SON) polygon. To get a response, the location data needs to be stored in the GeoJSON format. You can use the syntax given on the screen to use the $geoWith Operator. The example given on the screen selects all points and shapes that exist entirely within a GeoJSON polygon. We will discuss proximity queries in MongoDB in the next screen.
4.49 Proximity Queries in MongoDB
Proximity queries return the points closest to the specified point. These queries sort the results by its proximity to the specified point. You need to create a 2dsphere index in order to perform a proximity query on the GeoJSON data points. To query the data, you can either use the $near or $geonear (read as geo near) operator. The first syntax given on the screen is an example of the $near operator. The $geoNear command uses the second syntax given on the screen. This command offers additional options and returns further information than the $near operator. In the next screen, we will discuss aggregation.
Operations that process data sets and return calculated results are called aggregations. MongoDB provides data aggregations that examine data sets and perform calculations on them. Aggregation is run on the mongod instance to simplify application codes and limit resource requirements. Similar to queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents The aggregation framework in MongoDB is based on data processing pipelines. Documents pass through multi-stage pipelines and gets transformed into an aggregated result. The most basic pipeline stage in the aggregation framework provides filters that function like queries. It also provides document transformations that modify the output document. The pipeline operations group and sort documents by defined field or fields. In addition, they perform aggregation on arrays. Pipeline stages can use operators to perform tasks such as calculate the average or concatenate a string. The pipeline uses native operations within MongoDB to allow efficient data aggregation and is the favoured method for data aggregation. In the next screen, we will continue our discussion on data aggregation.
4.51 Aggregation (contd.)
With the help of the aggregate function, you can perform complex aggregation operations, such as finding out the total transaction amount for each customer. In the diagram shown on the screen, “orders” is the collection that has three fields, cust_id, amount, and status. In the $match (read as match) stage, you will filter out those documents in which status field value is “A”. In the group stage, you will aggregate the “amount” field for each cust_id. In the next screen, we will discuss pipeline operators and indexes.
4.52 Pipeline Operators and Indexes
The aggregate command in MongoDB, functions on a single collection and logically passes the collection through the aggregation pipeline. You can optimize the operation and avoid scanning the entire collection by using the $match, $limit, and $kip (read as match, limit and skip) stages. You may require only a subset of data from a collection to perform an aggregation operation. Therefore, use the $match, $limit, and $skip stages to filter the documents. When placed at the beginning of a pipeline, the $match operation scans and selects only the matching documents in a collection. Placing a $match before $sort in the pipeline stage is equivalent to using a query in which the sorting function is performed before looking into the indexes. Therefore, it is recommended to use $match operators at the beginning of the pipeline. In the next screen, we will discuss aggregate pipeline stages.
4.53 Aggregate Pipeline Stages
Typically, pipeline stages appear in an array. Documents are passed through the pipeline stages in a proper order one after the other. Barring $out and $geoNear, all stages of the pipeline can appear multiple times. The db.collection.aggregate()(read as DB dot collection dot aggregate) method provides access to the aggregation pipeline and returns a cursor and result sets of any size. The various pipeline stages are as follows. $project (read as project): This stage adds new fields or removes existing fields and thus restructure each document in the stream. This stage returns one output document for each input document provided. $match(read as match): It filters the document stream and allows only matching documents to pass into the next stage without any modification. $match uses the standard MongoDB queries. For each input document, it returns either one output document if there is a match or zero documents, when there is no match. $group(read as group): This stage groups documents based on the specified identifier expression and applies logic known as accumulator expression to compute the output document. $sort(read as sort). This stage rearranges the order of the document stream using specified sort keys. The documents remain unaltered even though the order changes. This stage provides one output document for each input document. We will continue our discussion on aggregate pipeline stages in the next screen.
4.54 Aggregate Pipeline Stages (contd.)
Some more pipeline stages include the following. $skip: This stage skips the first n documents where n is the specified skip number. It passes the remaining documents without any modifications to the pipeline. For each input document, it returns either zero documents for the first n documents or one document. $limit: It passes the first n number of documents without any modifications to the pipeline. For each input document, this stage returns either one document for the first n documents or zero documents after the first n documents. $unwind (read as unwind): It deconstructs an array field in the input documents to return a document for each element. Each output document replaces the array with an element value. For each input document, it returns n documents where n is the number of array elements and can be zero for an empty array. In the next screen, we will discuss pipeline operators and indexes. The aggregation operation given on the screen returns all states with total population greater than 10 million. This example depicts that the aggregation pipeline contains the $group stage followed by the $match stage. In this operation, the $group stage does three things: 1. Groups the documents of the zipcode collection under the state field 2. Calculates thetotalPop (read as the total pupulation) field for each state, and 3. Returns an output document for each unique state. The new per-state documents contains two fields: the _id field and the totalPop field. Here in this command the aggregate pipeline is used. The $sort stage orders those documents and $group stage applies the sum operation on the amount fields of those documents. The second aggregation operation shown on the screen returns user names sorted by the month of their joining. This kind of aggregation could help generate membership renewal notices. In the next screen, we will view a demo on using the aggregate pipeline framework.
4.56 Demo—Use Aggregate Function
This demo will show you the steps to use the aggregate pipeline framework in MongoDB. Click the demo icon to view the demo.
Mapreduce is a data processing model used for aggregation. To perform MapReduce operations, MongoDB provides the MapReduce database command. A MapReduce operation consists of two phases. In the map stage, documents are processed and one or more objects are produced for each input document. In the reduce stage, the outputs of the map operation are combined. Optionally, there can be an additional stage to make final modifications to the result. Similar to other aggregation operations, MapReduce can define a query condition to select the input documents, and sort and limit the results. We will continue our discussion on MapReduce in the next screen.
4.59 MapReduce (contd.)
4.60 MapReduce (contd.)
If a collection is sharded, then you can use MapReduce to perform many complex aggregation operations. The diagram shown on the screen depicts the orders in the collection having three fields—cust_id, amount and status. If you want to find out the sum of the total amount for each customer, then use the MapReduce framework. In the map stage, cust_id and amount will be generated as the key. Value will be further processed by the reduce stage in which cust_id and array of amount will be passed as input to each reducer. The reducer then finds out the total of amount and generate cust_id as key and order_totals as value. In the next screen, we will view a demo on using the MapReduce function.
4.61 Demo—Use MapReduce in MongoDB
This demo will show you the steps to use the MapReduce function in MongoDB. Click the demo icon to view the demo.
4.63 Aggregation Operations
Aggregations are operations that manipulate data and return a computed result based on the input document and a specific procedure. MongoDB performs aggregation operations on data sets. Aggregation operations have limited scope compared to the aggregation pipeline and MapReduce functions. Aggregation operations provide the following semantics for common data processing options. Count MongoDB returns all of the documents matching a query. The count command along with the two methods, count() and cursor.count() provide access to total counts in the mongo shell. The db.customer_info.count() (read as DB dot customer underscore info dot count method) command as shown in the screen helps count all documents in the customer_info (read as customer underscore info) collection. Distinct The distinct operation searches for documents matching a query and returns all unique values for a field in the matched document. The distinct command and db.collection.distinct() method execute this operation in the mongo shell. The syntax given on the screen is an example of a distinct operation. In the next screen, we will view a demo on how to use the distinct and count methods.
4.64 Demo—Use Distinct and Count Methods
This demo will show you the steps to use the distinct and count methods in MongoDB. Click the demo icon to view the demo.
4.66 Aggregation Operations (contd.)
Group operations accept sets of documents as input which matches the given query, apply the operation, and then return array of documents with the computed results. Note that group does not support sharded collection data. In addition, the results of the group operation must not exceed 16 megabytes. The group operation shown on the screen groups documents by the field ‘a’, where ‘a’ is less than three and sums the field count for each group. In the next screen, we will view a demo on how to use the group function in MongoDB.
4.67 Demo—Use the Group Function
This demo will show you the steps to use the group function in MongoDB. Click the demo icon to view the demo.
With this, we come to the end of this lesson. Following are few questions to test your understanding of the concepts discussed here.
Here is a quick recap of what was covered in this lesson: • Indexes are data structures that store data set in easily traversable form. • Indexes help execute queries efficiently without performing a collection scan. • MongoDB supports the following indexes—single field, compound, multikey, geospatial, text, and hashes. • For fast query operation, the system RAM must be able to accommodate index sizes. • You can create, modify, rebuild, and drop indexes. • The geospatial indexes help query geographic location by specifying a specific point. • The aggregation operations manipulate data and return a computed result based on the input and a specific procedure. • Aggregation functions in MongoDB help query operations such as calculating total sum spent by a customer on online shopping site.
This concludes the lesson Indexing and Aggregation in MongoDB. In the next lesson, we will discuss Replication and Sharding in MongoDB.
About the On-Demand Webinar
About the Webinar