This lesson provides an overview of driver token aware policy, prepared statements, and shard aware drivers.
Now let’s see how does a client driver query a Cassandra or a ScyllaDB cluster because now that
we know this we could guess and expect that client drivers use this knowledge to find out
and optimize their query plan. A naive client would go on like this, when a client connects
to a Cassandra or ScyllaDB cluster, it opens a connection to every node of the cluster
when it wants to issue a query a naive client would pick one of its connection randomly let’s
say and issue the query to the node, the node it issues the query to will be seen as the from the
client’s perspective it will act as what is called a coordinator because he’s coordinating the query
and he’s taking the ephemeral responsibility of routing the query internally in the cluster
to the right nodes that are replicas for this data, that means that is responsible for the partition
the query belongs to and gathering the responses and then responding back to the client, but
maybe this coordinator is not a replica for the query data, if it’s the if it’s not the case if the
coordinator is not a replica for the query data it has to issue the queries to all replicas itself
that means that you will add it will have an extra hop inside in inside the cluster to to
get the responses, this is sub-optimal of course as it consumes network and processing power
on the coordinator node for something that the client could have guessed in the first place
because since the partitioner hash function is known our client library
can use it to predict data location on the cluster and optimize the query routing
this is what the Cassandra driver does, using the token aware policy. How does it work, token aware
clients apply the partitional logic to select the right connection to the right node and make sure
that its coordinator node is also a replica of the queried data, this is cool and this is very
efficient as a result we say network hops lower the cluster internal load and get reduced
query latency, meaning, faster queries and let’s see how the Cassandra driver does it
internally so the token aware policy from the point of view of the Python Cassandra driver
the partition key is seen as a routing key because it will be used to route the query, right
so it is seen as a routing key which is used to determine which node are the replicas for the
query and to allow our Python driver to know about the partition key of a query the
query itself must be prepared as a server-side statement, this is how it looks in Python
Cassandra’s and ScyllaDB’s prepared statements that you can see them a bit like stored procedures
in the CQL world if you see statement=session.prepare and then you
express the query that you want and when you have an argument or parameter you just put
an interrogation mark and this is the recommended and most optimal way to query the
data because when you have prepared your query it is validated and it lives on the server
side so you don’t have to pass it and you just have to pass a reference to it and then only pass
the arguments and the arguments one of them will be the mandatory routing key, partition key then
routing key so statement plus routing key equal nodes, and it’s very cool
another thing to note it is that just like prepared stop procedures prepared statements
are also the safest way because they prevent query injections, so please in production at the
bare minimum only use prepared statements when you issue queries to Cassandra or ScyllaDB clusters then
the Python Cassandra driver defaults to the token aware, to route the query and then
it also defaults to a DCAwareRoundRobin load-balancing query routing policy it’s
it’s a bit long but what it means is that it will load balance for you in a round-robin
fashion so, one after the other after the other like this so it’s a bare minimal
load-balancing algorithm there is but it’s still pretty efficient, so don’t worry if even if your
cluster is not spread between multiple data centers it still works it just
happens to be the default, so it’s TokenAware plus DCAwareRoundRobin
by doing so the query routine will not only hit the right node holding a copy of the data
that you seek to, remember it’s called the replica but also load balance the queries
evenly between all its replicas so one could think yeah this is awesome and optimal
I mean from a Cassandra cluster point of view it is and we can’t do better than this
but not with a ScyllaDB one, remember ScyllaDB shards the data one way further down to node CPUs so
having token awareness is cool
but if our client had shard awareness it would be even cooler because
this means that a token aware client could be extended to become a Shard Aware Client
to route its queries not only to nodes but right to their CPU cores, this is very interesting to do
*This lesson was written with the help of Alexys Jacob. Alexys, also known to the developer community across social media, as @ultrabug is the CTO at Numberly, an Open-Source contributor, Gentoo Linux developer, and PSF contributing member. The lesson is based on his talk at the Europython 2020 Conference. Thank you, Alexys!