The Impact of Retirement on Marriage

Retirement offers the promise and allure of many exciting options, not the least of which are possibilities for a couple to start or resume hobbies, plan and take exciting trips, enroll in…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How Cassandra organizes data

Welcome to Part 2 of Apache Cassandra® series. In Part 1, we introduced you to the features of Cassandra — a powerful, distributed NoSQL database trusted by thousands of global enterprises. In this post, we show you how to build advanced data models on Cassandra so you can go on to build successful applications.

Apache Cassandra® is a distributed NoSQL database well known for its big data scale and disaster tolerance capabilities. Offering great features like elastic linear scalability, a high level of performance, and the ability to handle millions of queries while keeping petabytes of data, Cassandra is used by the vast majority of Fortune 100 companies.

But remember: this powerful capacity is a shared responsibility between developers and Cassandra. It’s not enough just to launch a cluster. It’s crucial to develop a proper data model to take advantage of all that Cassandra has to offer. Otherwise, these powerful features will fall short.

In this post, we go beyond queries, partitions and tables to design an efficient data model for highly-loaded applications. We’ll discuss the following:

Figure 1. Cassandra stores data in tables.

Knowing how Cassandra organizes data is essential to building a correct data model so you can deliver successful and efficient applications. The main components of Cassandra’s data structure include:

Let’s look at each of the components in more detail.

A keyspace determines how data replicates on multiple nodes and the total number of replicas across the cluster is known as the replication factor. In Figure 2, ‘3’ and ‘5’ are the replication factors for “DC-West” and “DC-East” respectively. Data replication is an innate feature of Cassandra to ensure reliability and fault tolerance.

When creating or modifying a keyspace, you need to specify a replication strategy that determines the nodes where replicas are placed. There are two kinds:

A partition key specifies which node will hold a particular table row, and clustering columns ensure data uniqueness and establishes sorting order. Note that once you’ve set a primary key for your table, it can’t be changed.

There are two kinds of partitions:

Depending on the number of users, you can have as many partitions as you want. It’s a common misconception that a large number of partitions make your data model inefficient, but even billions of partitions won’t affect Cassandra’s performance. However, there are some limitations on the rows inside a partition. You usually don’t want to have more than a hundred thousand rows in a single partition.

Once you set a partition key for your table, a partitioner transforms the value in the partition key to tokens (also called hashing) and assigns every node with a range of data called a token range. Then, Cassandra automatically distributes each row of data across the cluster by the token value.

Figure 4 represents typical sample CQL queries on Cassandra. Everything works here because the queries match the clustering key (C) and partition keys (K).

Figure 5. Invalid CQL queries.

But in Figure 5, you can see invalid CQL queries. The first two don’t work because they only specify half of the partition key “venue” and Cassandra can’t calculate the token. For the query to be valid, you need both “venue” and “year”.

Although there are both “venue” and “year” in the fourth query, it also doesn’t work because “title” is a data column, and not part of the primary key. The fifth query has a similar problem. “Country” is a static field and you’ll still need to calculate the partition using the partition key and clustering key.

To sum up, there are some important implications to keep in mind when working with data on Cassandra:

When working with queries, consider the following:

These implications ensure that Cassandra can handle petabytes of data and answer your queries within milliseconds, while still being globally available with multiple data centers.

Moving on, there are four objectives of the Cassandra data modeling methodology:

IoT conceptual data model
Figure 7. Entity-relationship diagram for IoT sensor data model.

The first example is sensor network or IoT data, similar to the data you would see for a smart home system. Let’s look at how you would model it on Cassandra.

4. Physical data model: You can create a physical data model directly from a logical data model by analyzing and optimizing for performance. The physical data model defines data types and determines if we need secondary indices or materialized views. The most common type of analysis is identifying potentially large partitions. Some common optimization techniques include splitting and merging partitions, data indexing, data aggregation and concurrent data access optimizations.

In this example, the model optimizes data retrieval by creating a new partition key “bucket”. It also limits the growth of the partition size of a table by introducing a new partition key “week,” a technique known as packeting.

Add a comment

Related posts:

The Great Plunge

This is a brief feature article I’ve written a couple of years ago in a journalistic workshop. I was tasked to write about my most memorable moment as a campus journalist. A glimpse into a sea of…

Ways to Manage HeartBurn While Traveling

Traveling has the innate power to electrify and your senses. However, the excitement can go for a toss if you choose to ignore your health. While traveling is a therapeutic experience for some, the…

10 Rules of Email Etiquette

From the Student Office to the library, emails have become the most common mode of communication in universities. However, the lines between professional and informal emails have become blurry as…