NoSQL Database Notes | DBMS SPPU 2024 Pattern

Introduction to NoSQL Databases

For decades, Relational Database Management Systems (RDBMS) like MySQL and Oracle dominated the software industry. These databases store data in strict, predefined tables using rows and columns. However, with the explosion of the internet, social media, and IoT (Internet of Things), the volume, velocity, and variety of data grew massively. Traditional relational databases struggled to scale efficiently to handle this unstructured "Big Data."

This gave rise to NoSQL databases. NoSQL stands for "Not Only SQL." It is a class of database management systems that do not use the traditional relational (table-based) model. Instead, NoSQL databases are designed to store and process large amounts of unstructured or semi-structured data rapidly and flexibly. They are highly scalable and are built to run on distributed networks.

Hinglish Explanation: NoSQL ka matlab yeh nahi hai ki hum SQL ko puri tarah reject kar rahe hain. Iska matlab hai "Not Only SQL". Jab hamare paas bahut saara unstructured data (jaise social media posts ya sensor data) hota hai, jise tables mein fit karna mushkil hota hai, tab hum NoSQL databases ka use karte hain kyunki ye fast aur flexible hote hain.

Important Features of NoSQL Databases

Flexible Schema: You do not need to define a strict structure (like columns and data types) before inserting data. Each record can have a different structure.
Horizontal Scalability: Traditional databases scale vertically (adding more RAM or CPU to a single server). NoSQL databases scale horizontally (adding more standard servers to distribute the load).
Distributed Architecture: They are designed to run across multiple machines, ensuring that if one machine fails, the database remains operational.
Handling Unstructured Data: They easily handle JSON files, images, logs, and graph data.

NoSQL Data Models

Unlike SQL databases, which only use tables, NoSQL offers four primary data models. The choice of model depends entirely on the type of application being built.

1. Key-Value Stores

This is the simplest type of NoSQL database. Every single item in the database is stored as an attribute name (key) together with its value. It works exactly like a dictionary or a hash map in programming.

Working: The database uses a unique key to fetch the corresponding value. The value can be a simple string, a number, or a complex object.
Applications: Session management in web applications, user preferences, and shopping carts.
Examples: Redis, Amazon DynamoDB, Riak.

Hinglish Explanation: Key-Value store bilkul ek dictionary ki tarah hota hai. Aap ek unique 'Key' set karte hain aur uske andar ek 'Value' store karte hain. Data search karna isme sabse fast hota hai kyunki system direct key ko dhoondhta hai.

2. Document Databases

Document databases store data in documents similar to JSON (JavaScript Object Notation) objects. Each document contains pairs of fields and values. The values can typically be a variety of types including things like strings, numbers, booleans, arrays, or even nested objects.

Working: Instead of spreading a user's data across multiple tables (like Profile, Address, Orders), a document database stores all information about that user in a single, structured document.
Applications: Content management systems, e-commerce catalogs, and user profiles.
Examples: MongoDB, CouchDB.

3. Column-Family Stores

Also known as wide-column stores, these databases organize data into columns instead of rows. A column family is a container for rows, but unlike relational databases, every row does not need to have the same columns.

Working: They group related columns together and store them physically together on the disk. This makes reading and calculating specific columns (like finding the average salary of millions of employees) extremely fast.
Applications: Analytics, recommendation engines, and time-series data (like weather monitoring).
Examples: Apache Cassandra, HBase.

4. Graph Databases

Graph databases are highly specialized NoSQL databases used to store and navigate relationships. Data is stored as Nodes (entities like people or cities) and Edges (relationships between them, like "is friends with" or "is connected to").

Working: They are optimized to traverse relationships quickly without using complex and slow JOIN queries.
Applications: Social networks (Facebook, LinkedIn), fraud detection, and routing applications.
Examples: Neo4j, Amazon Neptune.

Hinglish Explanation: Graph database tab use hota hai jab data se zyada data ke beech ka connection (relationship) important ho. Jaise Facebook par "Mutual Friends" nikalne ke liye Graph database sabse best kaam karta hai.

CAP Theorem

The CAP theorem is the foundational principle for designing distributed systems (systems where data is stored across multiple servers/nodes). Formulated by computer scientist Eric Brewer, the theorem states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

1. Consistency (C)

Every read request receives the most recent write (the latest updated data) or an error. If a user updates their password on Server A, any subsequent login attempt on Server B must reflect the new password instantly. All nodes see the exact same data at the same time.

2. Availability (A)

Every request receives a non-error response, regardless of the individual state of a node. The system is always on and always responds, even if some servers are down. However, it does not guarantee that the response contains the most recent data.

3. Partition Tolerance (P)

The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes. A "partition" is a communication break within a distributed system.

The Rule of CAP Theorem

In the real world, network failures (Partitions) are inevitable. Therefore, a distributed system MUST support Partition Tolerance (P). Because P is mandatory, system designers must choose between Consistency (C) and Availability (A) when a network failure occurs.

CP (Consistency + Partition Tolerance): If the network breaks, the system will block transactions (become unavailable) until the network is fixed, ensuring data is never out of sync. (Example: Banking systems, MongoDB).
AP (Availability + Partition Tolerance): If the network breaks, the system will keep accepting requests and serving data, even if the data is slightly outdated. (Example: Social media feeds, Cassandra).

Hinglish Explanation: CAP theorem kehta hai ki jab multiple servers ek sath kaam karte hain, toh network fail hona pakka hai (Partition Tolerance). Jab network toot jaye, toh aapko decide karna padta hai: ya toh aap system band kar dein taaki galat data na dikhe (Consistency), ya fir system chalu rakhein bhale hi thoda purana data dikh jaye (Availability). Aap dono ek sath nahi de sakte.

BASE Properties

Relational databases follow ACID properties (Atomicity, Consistency, Isolation, Durability) to ensure strict data integrity. NoSQL databases, which prioritize scalability and availability (often AP systems in the CAP theorem), follow the BASE properties.

BASE stands for:

1. Basically Available (BA)

The database appears to work most of the time. Even if parts of the database fail or network partitions occur, the system will still generate a response (even if it is stale data). Availability is the highest priority.

2. Soft State (S)

Because the system does not guarantee immediate consistency, the state of the system can change over time, even without any new input. Data might be syncing in the background across different servers.

3. Eventual Consistency (E)

The system will eventually become consistent once it stops receiving input. The data will propagate to all nodes, and after a short period, all users will see the same updated data.

Hinglish Explanation: BASE properties ka main focus system ko humesha on (available) rakhna hai. 'Eventual Consistency' ka matlab hai ki agar aapne koi photo upload ki, toh shayad wo sabhi doston ko turant na dikhe, par thodi der baad (eventually) sabko dikhne lagegi. System strict nahi hai, flexible hai.

Comparative Study of SQL and NoSQL

Parameter	SQL (Relational Database)	NoSQL (Non-Relational Database)
Data Model	Table-based (Rows and Columns).	Document, Key-Value, Graph, or Wide-Column.
Schema	Rigid and predefined. You must define columns before inserting data.	Flexible and dynamic. Documents can have different structures.
Scaling	Vertically Scalable (Requires upgrading CPU/RAM of a single server).	Horizontally Scalable (Easily adds more servers to the cluster).
Transactions	Follows strict ACID properties. Excellent for complex transactions.	Follows BASE properties. Focuses on availability over immediate consistency.
Relationships	Handles complex relationships natively using JOIN operations.	Not designed for complex JOINs; data is typically nested together.
Query Language	Uses SQL (Structured Query Language).	Uses object-oriented APIs or JSON-based query languages.
Best Used For	Financial systems, ERPs, systems requiring high data integrity.	Big data, real-time analytics, content management, rapid development.
Examples	MySQL, PostgreSQL, Oracle, SQL Server.	MongoDB, Cassandra, Redis, Neo4j.

Introduction to MongoDB

MongoDB is one of the most popular open-source NoSQL databases in the world. It falls under the category of Document Databases.

Instead of storing data in tables and rows, MongoDB stores data in collections and documents.

Collection: Equivalent to a Table in an SQL database.
Document: Equivalent to a Row in an SQL database.

MongoDB stores data in a format called BSON (Binary JSON). This makes it incredibly easy for developers to map database objects directly to programming languages like JavaScript, Python, or Java.

Key Advantages of MongoDB

No rigid schema, making database updates faster.
Highly scalable through a process called Sharding.
High performance for read and write operations.

Hinglish Explanation: MongoDB ek document database hai. Isme hum tables nahi banate, balki 'Collections' banate hain. Aur rows ki jagah hum 'Documents' save karte hain jo bilkul JSON (dictionary) format jaise dikhte hain. Isse code likhna aur data manage karna bahut aasan ho jata hai.

MongoDB: CRUD Operations

CRUD stands for Create, Read, Update, and Delete. These are the four basic operations performed on any database. Below are the standard commands used in the MongoDB shell.

1. Create (Inserting Data)

To add new documents into a collection, MongoDB provides insert methods. If the collection does not exist, MongoDB will automatically create it when you insert the first document.

Command 1: insertOne()

Used to insert a single document into a collection.

Syntax Example:

db.students.insertOne({ name: "Rahul", age: 21, branch: "Computer" })

Explanation: This command accesses the students collection and inserts one JSON-like document containing name, age, and branch. MongoDB automatically assigns a unique _id field to this document.

Command 2: insertMany()

Used to insert multiple documents at once by passing an array of documents.

Syntax Example:

db.students.insertMany([

{ name: "Amit", age: 22, branch: "IT" },

{ name: "Priya", age: 20, branch: "Mechanical" }

])

2. Read (Querying Data)

To retrieve data from a collection, MongoDB uses the find method.

Command 1: find()

Without any conditions, this fetches all documents in the collection (similar to SELECT * in SQL).

Syntax Example:

db.students.find()

To filter data, we pass a query object into the find method.

Syntax Example (Find students who are 21 years old):

db.students.find({ age: 21 })

Command 2: findOne()

Returns only the first document that matches the query.

Syntax Example:

db.students.findOne({ branch: "Computer" })

Hinglish Explanation: Read operation ka matlab hai database se data nikalna. find() use karne se saara data milta hai. Agar hum bracket ke andar condition dalte hain {age: 21}, toh sirf wahi data milega jinki age 21 hai.

3. Update (Modifying Data)

To modify existing documents, MongoDB uses update methods. It requires two parameters: a filter (to find the document) and an update operation (using operators like $set).

Command 1: updateOne()

Updates the first document that matches the filter.

Syntax Example (Change Rahul's age to 22):

db.students.updateOne(

{ name: "Rahul" },

{ $set: { age: 22 } }

)

Explanation: The first part { name: "Rahul" } finds the student. The second part { $set: { age: 22 } } applies the new value. The $set operator is compulsory, otherwise, the whole document gets replaced.

Command 2: updateMany()

Updates all documents that match the filter condition.

Syntax Example (Change branch to 'IT' for everyone whose age is 20):

db.students.updateMany(

{ age: 20 },

{ $set: { branch: "IT" } }

)

4. Delete (Removing Data)

Used to permanently remove documents from a collection.

Command 1: deleteOne()

Deletes the first document that matches the given condition.

Syntax Example:

db.students.deleteOne({ name: "Amit" })

Command 2: deleteMany()

Deletes all documents that match the condition.

Syntax Example (Delete all students who are from the Mechanical branch):

db.students.deleteMany({ branch: "Mechanical" })

Indexing in MongoDB

In databases, searching through every single document to find a match (called a collection scan) is extremely slow when dealing with millions of records.

What is an Index?

Indexing is a special data structure that stores a small portion of the collection's data set in an easy-to-traverse form. It works exactly like the index page at the back of a textbook. Instead of reading the whole book to find a topic, you look at the index, find the page number, and jump directly to it.

Creating an Index

In MongoDB, you can create an index on any field to make queries on that field extremely fast.

Syntax Example:

db.students.createIndex({ age: 1 })

Explanation: This command creates an index on the age field. The 1 indicates ascending order, while -1 would indicate descending order. After running this, any find() query searching for a specific age will execute much faster.

Hinglish Explanation: Indexing ka use database ki speed badhane ke liye hota hai. Jaise book ke peeche index hota hai kisi specific topic ko jaldi dhundhne ke liye, waise hi database mein indexing specific data ko bina poori table check kiye fast search karne mein madad karti hai.

Aggregation in MongoDB

While standard queries (find()) are good for retrieving data, Aggregation is used when you need to process data and return computed results (like calculating totals, averages, or grouping data).

The Aggregation Pipeline

MongoDB uses a pipeline concept for aggregation. Imagine an assembly line in a factory. The data passes through various "stages." Each stage transforms the data and passes it to the next stage until the final result is produced.

Common Aggregation Stages:

$match: Filters the documents (similar to the WHERE clause in SQL).
$group: Groups documents together based on a specific key (similar to GROUP BY in SQL).
$sort: Sorts the resulting documents.
$project: Selects specific fields to pass to the next stage.

Example of Aggregation Pipeline

Suppose we want to find the total number of students in each branch, but only for students who are older than 20.

Syntax Example:

db.students.aggregate([

{ $match: { age: { $gt: 20 } } },

{ $group: { _id: "$branch", totalStudents: { $sum: 1 } } }

])

Line-by-line Explanation:

db.students.aggregate([...]): Starts the aggregation pipeline.
{ $match: { age: { $gt: 20 } } }: This is the first stage. It filters out anyone aged 20 or below. ($gt means greater than). Only students older than 20 move to the next stage.
{ $group: { id: "$branch", totalStudents: { $sum: 1 } } }: This is the second stage. It takes the filtered students, groups them by their branch name (using id: "$branch"), and counts them by adding 1 for every document ($sum: 1).

Hinglish Explanation: Aggregation pipeline ek factory ki assembly line jaisi hai. Pehle stage ($match) mein humne un students ko filter kiya jinki age 20 se zyada hai. Phir bachhe hue data ko dusre stage ($group) mein bheja jahan humne unhe branch ke hisaab se alag kiya aur count kar liya. Yeh complex calculations ke liye use hota hai.

Summary and Key Takeaways

NoSQL Databases provide a flexible, scalable, and high-performance alternative to traditional SQL databases, perfect for handling modern Big Data.
There are four main NoSQL Data Models: Key-Value, Document, Column-Family, and Graph.
The CAP Theorem dictates that a distributed system can only provide two out of three guarantees: Consistency, Availability, and Partition Tolerance.
BASE Properties (Basically Available, Soft State, Eventual Consistency) prioritize system availability and flexibility over the strict rules of ACID.
MongoDB is a highly popular Document database that stores data in JSON-like structures called BSON.
CRUD Operations in MongoDB (insert, find, update, delete) are simple to execute and do not require predefined schemas.
Indexing drastically improves search performance, while Aggregation Pipelines allow for complex data processing and summarization directly within the database.

SEO Keywords Section

Search keywords related to this topic:

Introduction to NoSQL database, NoSQL data models explained, Difference between SQL and NoSQL databases, CAP theorem in distributed systems, Consistency Availability Partition Tolerance, BASE properties in NoSQL, ACID vs BASE properties, Types of NoSQL databases Key-Value Document Graph, MongoDB CRUD operations tutorial, MongoDB insert find update delete syntax, Indexing in MongoDB explained, MongoDB Aggregation pipeline examples, match group sort in MongoDB, Computer Engineering database management notes, DBMS NoSQL concepts for beginners, Database management system exam preparation notes.

Unit 5 – NoSQL Database | DBMS SPPU 2024 Pattern

Introduction to NoSQL Databases

Comments (0)