Cassandra
Terms & Definitions
Infrastructure terminology
Term
Definition
AKA
Node
Logical atomic unit of processing
Database server instance
Rack
A set of Nodes
Load-balanced database server instances
Data Center
A set of Racks
Highly available load-balanced database server instances
Cluster
A set of Data Centers
Globally available load-balanced database server instances
Database terminology
Term
Definition
AKA
Keyspace
Contains data related to a single application
Schema
Table
Contains data related to a single object (originally known as Column Family)
Database table/Collection
Row
Contains data for a single object instance with a structure equal to the Column Family the Row is under
Row (duh?)
Primary Key
Composed of the Partition Key and Clustering Keys
Nothing like this
Partition Key
Non-unique identifier for a Row
Primary key
Clustering Key
Identifier for Rows with
A GROUP BY
-ish key
Opertaional terminology
Term
Definition
AKA
Bloom Filter
Reduces the seek-time for read operations
-
CQL
Cassandra Query Language
(Yet another) domain-specific SQL
Gossip
Protocol used to achieve eventual consistency
The syncing protocol
Partition Key Cache
Stores the location of partitions which are commonly queried. Default feature
-
Row Cache
A cache reserved for frequently queried data rows. Optional feature
-
Seed Node
A node that bootstraps other nodes
Master database instance
Usage
Consistency Check
Before querying, Cassandra checks if replicas consistency is satisfied. If not, an exception is thrown
Read Path
Reading is more expensive than writing because of the Consistency Check and Read Repair
Read Repair
Tombstone
Cassandra does not delete data immediately, it tags a Row for deletion and this is called a Tombstone
Availability Scaling
Consistency
Cassandra is eventually consistent
One, Two, Three
Specified number of replicas must agree
Quorum
A majority of replicas must agree
Local_ (modifier)
Local
refers to the Data CenterLocal_two
- two replicas within the Data Center must agreeLocal_quorum
- majority of replicas within the Data Center must agree
Each_ (modifier)
Each
refers to the Data CenterEach_two
- all Data Centers must have two replicas within that Data Center agreeEach_quorum
- all Data Centers must have a majority of replicas within the Data Center agree
Data Partition
Made up of rows that share a common Partition Key
Data Replication
Configured per keyspace (replication factor per data center)
Data Centers
Simulates real-world data centers
Holds multiple racks
Enable segregation of workloads that require the same data set
Enable lower latency by processing data as close to the source region as possible
Enables a disaster-recovery-ready setup
Node
Owns set(s) of Data Partitions
Responsible for multiple token-ranges via Vnodes
Each node by default is assigned 256 tokens (
num_tokens
option)
Partition
Ideally less than 10 MB
Definitely less than 100MB
Size of this is a crucial attribute for Cassandra performance
Partition Key
Each Table contains one
Can be standalone or composite
Converted to a Token by a partitioner
Default partitioner is
Murmur3Partitioner
Replication
Defined by the replication factor
One primary replica which exists in the Node that owns the associated Token
Replication Factor
Ideally an odd number
Number of Racks should be in multiples of the Replication Factor
Replication Strategy
Defined at the keyspace level
Either
SimpleStrategy
orNetworkTopologyStrategy
SimpleStrategy
does not consider Racks/Data Centers. Places data replicas on nodes sequentiallyNetworkTopologyStrategy
does the above
Replica Placement Strategy
Defines placement of a Data Partition on other nodes
Snitches
Determines the Data Center and Rack a Node belongs to
Defined at the Node level
Defaults to
GossipingPropertyFileSnitch
Dynamic snitches exist to keep track of node latencies dynmically
Token
Unique identifer referencing a Data Partition
Range:
-2e63
to+2e63 - 1
- each node owns a portion of this range and primarily owns data corresponding to the rangeToken range is split as cluster grows (eg. if one node => one node owns entire range, if X nodes => X nodes own the entire range)
VNode
A virtual node structure that enables splitting of a node's token-range into smaller ranges
Each physical node is assigned a number of virtual nodes
Performance Scaling
Partition sizes
Partition size affects JVM heap size
Partition size affects GC
Big partition makes garbage collection inefficient (GC works over a larger JVM heap)
Partition size affects Cassandra Repairs
Querying
Use as little Partition Keys as possible - 1 if possible
Cassandra is more efficient because data is retrieved from a single Partition
Error messages
Data center differs from previous data center
Run
kubectl edit statefulset cassandra
and add the flag
Snitch's rack differs from previous rack
References
Last updated
Was this helpful?