Collections describe a group of items connected to a single key. If used correctly, they can help simplify data modeling.
Remember to use the appropriate collection per use case.
Keep collections small to prevent high latency during querying the data
Sets are ordered alphabetically or based on the natural sorting method of the type. Examples: multiple email addresses or phone numbers per user.
Lists are ordered objects based on the user’s definition
Maps is a name and a pair of typed values, they are very helpful with sequential events logging
Transcript
Okay,
So I talked about the importance of choosing a partition key
and the clustering key, some things to keep in mind.
This is sort of a summary, so we want to have
a high cardinality which goes together
for goes together with even distribution.
Right? So
I’m going to talk about some examples of partition keys,
but what you saw in the example was
some “uuid” which is usually
random and has high cardinality
and that’s something we strive for.
So we want an even distribution with high cardinality,
and we want to avoid the case where
some of the nodes in the cluster have
more data than others, so they become large partitions.
Think of a case where we don’t use
high cardinality and then even distribution.
We might have a case where we have
say a cluster with ten nodes,
but because the distribution is not even,
some of the nodes hold the majority of the data
and it’s easy to understand that that’s wasting resources
and it’s not going to achieve great performance.
Another thing to consider is hot partitions.
So if we know that in our application
some of the data is going to be accessed
more often than other data,
then we want to take that into account.
We want to avoid a situation where some of the nodes
are serving more requests than others
and we should take that into account when we
define our partition key as well.
So some examples of
typically good partition keys
would be a username or a user id like we saw.
It can be a combination of different columns like user
ID and the time together to make that into one
partition key.
Some examples of partition keys that are
usually not so great are
a state, let’s say we’re having
a table of people in the United States
and one of the columns is state; there are 50 states
in the United States
and choosing the state as the partition key
is probably not a great idea
because some states have way more people than other states.
So California, for example, has
millions of people, while other states have just
a few hundred thousand people.
And then we would get our data that’s not evenly distributed.
We would have some nodes that become large nodes.
They would hold most of the data.
And also we might get hot partitions because I’m assuming
that if we have equal access to all people, then
the nodes that
hold the data for, say,. California or other large states
would be serving more requests than ones
for smaller states.
Same goes for age, which is not an
even distribution in most cases, or your favourite sports team.
For clustering key, it’s useful to, again,
take into account what queries you’re going to be using
in your application, think about the queries
when you create the data model and also think about
the order.
So according to the queries, are you interested
in the very first
rows in the partition or maybe in the very last ones
like we had in this example where we were interested in the
say, last minute of data for our heart rates for the dog.
So keep that in mind when you select your clustering key
and when you define the clustering order
and that allows for efficient queries using limits.