MongoDB is one of the most popular document databases. It’s the M in the MEAN stack (MongoDB, Express, Angular, and Node.js). Unlike relational databases such as MySQL or PostgreSQL, MongoDB uses JSON-like documents for storing data. MongoDB is free, open-source, and incredibly performant. However, just as with any other database, certain issues can cost MongoDB its edge and drag it down.

In this article, we’ll look at a few key metrics and what they mean for MongoDB performance. Specifically, we’ll look at the following areas:

Performance of locking in transactions
Memory usage
Connection handling
Issues with replica sets
Of course, MongoDB performance is a huge topic encompassing many areas of system activity.

Now, let’s get into it.

Analyze locking performance
How does MongoDB handle locking? Databases operate in an environment that consists of numerous reads, writes, and updates. Any one of a hundred clients can trigger any of these activities. They’re often not sequential, and they frequently use data that another request is in the middle of updating.

For example, if a client attempts to read a document that another client is updating, conflicts can occur. And this can cause lost or unexpectedly altered data. To get around this issue and maintain consistency, databases will lock certain documents or collections.

When a lock occurs, no other operation can read or modify the data until the operation that initiated the lock is finished. This prevents conflicts. But it can also severely degrade the database’s performance.

Consider another example. If a read operation is waiting for a write operation to complete, and the write operation is taking a long time, additional operations will also have to wait. In the case of a large write or read, that alone can be enough to noticeably degrade database performance. If the server is unresponsive for too long, it can cause a replica state change, which can lead to further cascading problems.

How to view lock metrics
Luckily, MongoDB provides some useful metrics to help determine if locking is affecting database performance. Let’s look at the globalLock and locks sections of the db.serverStatus() command output:

> db.serverStatus().globalLock
{
    "totalTime" : <num>,
    "currentQueue" : {
        "total" : <num>,
        "readers" : <num>,
        "writers" : <num>
    },
    "activeClients" : {
        "total" : <num>,
        "readers" : <num>,
        "writers" : <num>
    }
}

> db.serverStatus().locks
{
    <type> : {
        "acquireCount" : {
        <mode> : NumberLong(<num>),
        ...
    },
    "acquireWaitCount" : {
        <mode> : NumberLong(<num>),
        ...
    },
    "timeAcquiringMicros" : {
        <mode> : NumberLong(<num>),
        ...
    },
    "deadlockCount" : {
        <mode> : NumberLong(<num>),
        ...
    }
}
How should we interpret these numbers? Let’s start with the globalLocks section:

globalLock.currentQueue.total: This number can indicate a possible concurrency issue if it’s consistently high. This can happen if a lot of requests are waiting for a lock to be released.
globalLock.totalTime: If this is higher than the total database uptime, the database has been in a lock state for too long.
And here’s what the metrics mean in the locks section:

locks.<type>.acquireCount: Number of times the lock was acquired
locks.<type>.acquireWaitCount: Number of times the locks.acquireCount encountered waits because of conflicting locks
locks.<type>.timeAcquiringMicros: Cumulative wait time for lock acquisitions, in microseconds
locks.deadlockCount: Number of times the lock acquisitions have encountered deadlocks
Each of these possible lock types tracks the above metrics:

Global: Global locks
MMAPV1Journal: Locks to synchronize journal writes
Database: Database locks
Collection: Collection-related locks
Metadata: Metadata-related locks
Oplog: Operational log locks
We can also determine the average wait time for a particular lock type by dividing locks.timeAcquiringMicros by the locks.acquireWaitCount. MongoDB has a great FAQ that explains locking and concurrency in more detail. It also offers additional background information.

New call-to-action
Storage engines and locking
Is the database frequently locking from queries? This might indicate issues with the schema design, query structure, or system architecture.

For MongoDB versions before 3.2, the default storage engine is MMAPv1. For version 3.2 on, WiredTiger is the default.

Note the differences in how these storage engines handle locking. MMAPv1 locks whole collections, not individual documents. WiredTiger performs locking at the document level. This reduces locks. And, it lets us read or update multiple documents in a collection concurrently.

Examine memory use
When the MMAPv1 storage engine is in use, MongoDB will use memory-mapped files to store data. All available memory will be allocated for this usage if the data set is large enough.

We can use the metrics in the memory section of the serverStatus document to understand how MongoDB is using system memory:

> db.serverStatus().mem
"mem" : {
    "bits" : <int>,
    "resident" : <int>,
    "virtual" : <int>,
    "supported" : <boolean>,
    "mapped" : <int>,
    "mappedWithJournal" : <int>
},
Two of these fields, in particular, are interesting for understanding current memory usage:

mem.resident: Roughly equivalent to the amount of RAM in megabytes that the database process uses
mem.mapped: The amount of memory that the database maps, in megabytes
To see if we’ve exceeded the capacity of our system, we can compare the value of mem.resident to the amount of system memory. If mem.resident exceeds the value of system memory and there’s a large amount of unmapped data on disk, we’ve most likely exceeded system capacity.

If the value of mem.mapped is greater than the amount of system memory, some operations will experience page faults.

Tune the WiredTiger cache
The MMAPv1 storage engine is deprecated and will be removed in a future release. So, I’d advise you to move any existing MMAPv1 storage engines to the new WiredTiger storage engine. The WiredTiger storage engine is a significant improvement over MMAPv1 in performance and concurrency. It also offers the benefits of compression and encryption.

By default, MongoDB will reserve 50 percent of the available memory for the WiredTiger data cache. The size of this cache is important to ensure WiredTiger performs adequately. It’s worth taking a look to see if you should alter it from the default. A good rule of thumb is that the size of the cache should be big enough to hold the entire application working set.

How do we know whether to alter it? Look at the cache usage statistics:

> db.serverStatus().wiredTiger.cache
{
    "tracked dirty bytes in the cache" : <num>,
    "tracked bytes belonging to internal pages in the cache" : <num>,
    "bytes currently in the cache" : <num>,
    "tracked bytes belonging to leaf pages in the cache" : <num>,
    "maximum bytes configured" : <num>,
    "tracked bytes belonging to overflow pages in the cache" : <num>,
    "bytes read into cache" : <num>,
    "bytes written from cache" : <num>,
    "pages evicted by application threads" : <num>,
    "checkpoint blocked page eviction" : <num>,
    "unmodified pages evicted" : <num>,
    "page split during eviction deepened the tree" : <num>,
    "modified pages evicted" : <num>,
    "pages selected for eviction unable to be evicted" : <num>,
    "pages evicted because they exceeded the in-memory maximum" : <num>,
    "pages evicted because they had chains of deleted items" : <num>,
    "failed eviction of pages that exceeded the in-memory maximum" : <num>,
    "hazard pointer blocked page eviction" : <num>,
    "internal pages evicted" : <num>,
    "maximum page size at eviction" : <num>,
    "eviction server candidate queue empty when topping up" : <num>,
    "eviction server candidate queue not empty when topping up" : <num>,
    "eviction server evicting pages" : <num>,
    "eviction server populating queue, but not evicting pages" : <num>,
    "eviction server unable to reach eviction goal" : <num>,
    "internal pages split during eviction" : <num>,
    "leaf pages split during eviction" : <num>,
    "pages walked for eviction" : <num>,
    "eviction worker thread evicting pages" : <num>,
    "in-memory page splits" : <num>,
    "in-memory page passed criteria to be split" : <num>,
    "lookaside table insert calls" : <num>,
    "lookaside table remove calls" : <num>,
    "percentage overhead" : <num>,
    "tracked dirty pages in the cache" : <num>,
    "pages currently held in the cache" : <num>,
    "pages read into cache" : <num>,
    "pages read into cache requiring lookaside entries" : <num>,
    "pages written from cache" : <num>,
    "page written requiring lookaside records" : <num>,
    "pages written requiring in-memory restoration" : <num>
}
There’s a lot of data here, but we can focus on the following fields:

wiredTiger.cache.maximum bytes configured: This is the maximum cache size.
wiredTiger.cache.bytes currently in the cache – This is the size of the data currently in the cache. This should not be greater than the maximum bytes configured.
wiredTiger.cache.tracked dirty bytes in the cache – This is the size of the dirty data in the cache. This value should be less than the bytes currently in the cache value.
Looking at these values, we can determine if we need to up the size of the cache for our instance. Additionally, we can look at the wiredTiger.cache.bytes read into cache value for read-heavy applications. If this value is consistently high, increasing the cache size may improve overall read performance.

Monitor the number of current connections
Unless system limits constrain it, MongoDB has no limits on incoming connections. But there’s a catch. In some cases, a large number of connections between the application and database can overwhelm the database. This will limit its ability to handle additional connections.

You can find the connections metrics in the connections section of the serverStatus document:

> db.serverStatus().connections
{
    "current" : <num>,
    "available" : <num>,
    "totalCreated" : NumberLong(<num>)
}
There are two fields here we want to look at:

connections.current: The total number of current clients connected to this instance
connections.available: The total number of unused connections available to clients for this instance
If connection issues are a problem, there are a couple of strategies for resolving them. First, check whether the application is read-heavy. If it is, increase the size of the replica set and distribute the read operations to secondary members of the set. If this application is write-heavy, use sharding within a sharded cluster to distribute the load.

Application- or driver-related errors can also cause connection issues. For instance, a connection may be disposed of improperly or may open when not needed, if there’s a bug in the driver or application. You’ll see this if the number of connections is high, but there’s no corresponding workload.

Watch replication performance
Replication is the propagation of data from one node to another. It’s key to MongoDB being able to meet availability challenges. As data changes, it’s propagated from the primary node to secondary nodes. Replication sets handle this replication.

New call-to-action
Monitor replication lag
In a perfect world, data would be replicated among nodes almost instantaneously. But we don’t live in a perfect world. Sometimes, data isn’t replicated as quickly as we’d like. And, depending on the time it takes for a replication to occur, we run the risk of data becoming out of sync.

This is a particularly thorny problem if the lag between a primary and secondary node is high and the secondary becomes the primary. Because the replication didn’t occur quickly enough, data will be lost when the newly elected primary replicates to the new secondary.

So how do we know what our replication lag is?

We can use the db.printSlaveReplicationInfo() or the rs.printSlaveReplicationInfo() command to see the status of a replica set from the perspective of the secondary member of the set. The following is an example of running this command on a replica set with two secondary members:

source: m1.example.net:27017
syncedTo: Thu Apr 10 2014 10:27:47 GMT-0400 (EDT)
0 secs (0 hrs) behind the primary
source: m2.example.net:27017
syncedTo: Thu Apr 10 2014 10:27:47 GMT-0400 (EDT)
0 secs (0 hrs) behind the primary
The output of this command shows how far behind the secondary members are from the primary. This number should be as low as possible. However, it’s really going to be based on the application’s tolerance for a delay in replication. You should monitor this metric closely. In addition, administrators should watch for any spikes in replication delay. If replication lag is consistently high or it increases at a regular rate, that’s a clear sign of environmental or systemic problems. Always investigate these issues to understand the reasons for the lag.

Monitor replication state
As mentioned above, replica sets handle replication among nodes. One replica set is primary. All others are secondary. Under normal conditions, the assigned node status should rarely change. If a role change does occur—that is, a secondary node is elected primary—we want to know immediately. The election of a new primary usually occurs seamlessly. Still, you should understand what caused the status change. That’s because it could be due to a network or hardware failure. In addition, it’s not normal for nodes to change back and forth between primary and secondary.

Take it further with the database profiler
You can gather additional detailed performance information using the built-in database profiler. If additional performance tuning needs to happen, or none of the above seems to cut it, you can use the profiler to gain a deeper understanding of the database’s behavior. The profiler collects information about all database commands that are executed against an instance. This includes the common create, read, update, and delete operations. It also covers all configuration and administration commands. However, there’s a catch. Enabling the profiler can affect system performance, due to the additional activity.