Scylla: four ways to optimize your disk space consumption

We recently had to face free disk space outages on some of our scylla clusters and we learnt some very interesting things while outlining some improvements that could be made to the ScyllaDB guys.

100% disk space usage?

First of all I wanted to give a bit of a heads up about what happened when some of our scylla nodes reached (almost) 100% disk space usage.

Basically they:

  • stopped listening to client requests
  • complained in the logs
  • wouldn’t flush commitlog (expected)
  • abort their compaction work (which actually gave back a few GB of space)
  • stay in a stuck / unable to stop state (unexpected, this has been reported)

After restarting your scylla server, the first and obvious thing you can try to do to get out of this situation is to run the nodetool clearsnapshot command which will remove any data snapshot that could be lying around. That’s a handy command to reclaim space usually.

Reminder: depending on your compaction strategy, it is usually not advised to allow your data to grow over 50% of disk space...

But that’s only a patch so let’s go down the rabbit hole and look at the optimization options we have.


Optimize your schemas

Schema design and the types your choose for your columns have a huge impact on disk space usage! And in our case we indeed overlooked some of the optimizations that we could have done from the start and that did cost us a lot of wasted disk space. Fortunately it was easy and fast to change.

To illustrate this, I’ll take a sample of 100,000 rows of a simple and naive schema associating readings of 50 integers to a user ID:

Note: all those operations were done using Scylla 3.0.3 on Gentoo Linux.

CREATE TABLE IF NOT EXISTS test.not_optimized
(
uid text,
readings list<int>,
PRIMARY KEY(uid)
) WITH compression = {};

Once inserted on disk, this takes about 250MB of disk space:

250M    not_optimized-00cf1500520b11e9ae38000000000004

Now depending on your use case, if those readings at not meant to be updated for example you could use a frozen list instead, which will allow a huge storage optimization:

CREATE TABLE IF NOT EXISTS test.mid_optimized
(
uid text,
readings frozen<list<int>>,
PRIMARY KEY(uid)
) WITH compression = {};

With this frozen list we now consume 54MB of disk space for the same data!

54M     mid_optimized-011bae60520b11e9ae38000000000004

There’s another optimization that we could do since our user ID are UUIDs. Let’s switch to the uuid type instead of text:

CREATE TABLE IF NOT EXISTS test.optimized
(
uid uuid,
readings frozen<list<int>>,
PRIMARY KEY(uid)
) WITH compression = {};

By switching to uuid, we now consume 50MB of disk space: that’s a 80% reduced disk space consumption compared to the naive schema for the same data!

50M     optimized-01f74150520b11e9ae38000000000004

Enable compression

All those examples were not using compression. If your workload latencies allows it, you should probably enable compression on your sstables.

Let’s see its impact on our tables:

ALTER TABLE test.not_optimized WITH compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'};
ALTER TABLE test.mid_optimized WITH compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'};
ALTER TABLE test.optimized WITH compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'};

Then we run a nodetool compact test to force a (re)compaction of all the sstables and we get:

63M     not_optimized-00cf1500520b11e9ae38000000000004
28M mid_optimized-011bae60520b11e9ae38000000000004
24M optimized-01f74150520b11e9ae38000000000004

Compression is really a great gain here allowing another 50% reduced disk space usage reduction on our optimized table!

Switch to the new “mc” sstable format

Since the Scylla 3.0 release you can use the latest “mc” sstable storage format on your scylla clusters. It promises a greater efficiency for usually a way more reduced disk space consumption!

It is not enabled by default, you have to add the enable_sstables_mc_format: true parameter to your scylla.yaml for it to be taken into account.

Since it’s backward compatible, you have nothing else to do as new compactions will start being made using the “mc” storage format and the scylla server will seamlessly read from old sstables as well.

But in our case of immediate disk space outage, we switched to the new format one node at a time, dropped the data from it and ran a nodetool rebuild to reconstruct the whole node using the new sstable format.

Let’s demonstrate its impact on our test tables: we add the option to the scylla.yaml file, restart scylla-server and run nodetool compact test again:

49M     not_optimized-00cf1500520b11e9ae38000000000004
26M mid_optimized-011bae60520b11e9ae38000000000004
22M optimized-01f74150520b11e9ae38000000000004

That’s a pretty cool gain of disk space, even more for the not optimized version of our schema!

So if you’re in great need of disk space or it is hard for you to change your schemas, switching to the new “mc” sstable format is a simple and efficient way to free up some space without effort.

Consider using secondary indexes

While denormalization is the norm (yep.. legitimate pun) in the NoSQL world this does not mean we have to duplicate everything all the time. A good example lies in the internals of secondary indexes if your workload can compromise with its moderate impact on latency.

Secondary indexes on scylla are built on top of Materialized Views that basically stores an up to date pointer from your indexed column to your main table partition key. That means that secondary indexes MVs are not duplicating all the columns (and thus the data) from your main table as you would have to do when denormalizing a table to query by another column: this saves disk space!

This of course comes with a latency drawback because if your workload is interested in the other columns than the partition key of the main table, the coordinator node will actually issue two queries to get all your data:

  1. query the secondary index MV to get the pointer to the partition key of the main table
  2. query the main table with the partition key to get the rest of the columns you asked for

This has been an effective trick to avoid duplicating a table and save disk space for some of our workloads!

(not a tip) Move the commitlog to another disk / partition?

This should only be considered as a sort of emergency procedure or for cost efficiency (cheap disk tiering) on non critical clusters.

While this is possible even if the disk is not formatted using XFS, it not advised to separate the commitlog from data on modern SSD/NVMe disks but… you technically can do it (as we did) on non production clusters.

Switching is simple, you just need to change the commitlog_directory parameter in your scylla.yaml file.

py3status v3.17

I’m glad to announce a new (awaited) release of py3status featuring support for the sway window manager which allows py3status to enter the wayland environment!

Updated configuration and custom modules paths detection

The configuration section of the documentation explains the updated detection of the py3status configuration file (with respect of XDG_CONFIG environment variables):

  • ~/.config/py3status/config
  • ~/.config/i3status/config
  • ~/.config/i3/i3status.conf
  • ~/.i3status.conf
  • ~/.i3/i3status.conf
  • /etc/xdg/i3status/config
  • /etc/i3status.conf

Regarding custom modules paths detection, py3status does as described in the documentation:

  • ~/.config/py3status/modules
  • ~/.config/i3status/py3status
  • ~/.config/i3/py3status
  • ~/.i3/py3status

Highlights

Lots of modules improvements and clean ups, see changelog.

  • we worked on the documentation sections and content which allowed us to fix a bunch of typos
  • our magic @lasers have worked a lot on harmonizing thresholds on modules along with a lot of code clean ups
  • new module: scroll to scroll modules on your bar (#1748)
  • @lasers has worked a lot on a more granular pango support for modules output (still work to do as it breaks some composites)

Thanks contributors

  • Ajeet D’Souza
  • @boucman
  • Cody Hiar
  • @cyriunx
  • @duffydack
  • @lasers
  • Maxim Baz
  • Thiago Kenji Okada
  • Yaroslav Dronskii

Bye bye Google Analytics

A few days ago, I removed Google Analytics from my blog and trashed the associated account.

I’ve been part of the Marketing Tech and Advertising Tech industries for over 15 years. I design and operate data processing platforms (including web navigation trackers) for a living. So I thought that maybe sharing the reasons of why I took this decision might be of interest for some people. I’ll keep it short.

MY convenience is not a enough reason to send YOUR data to Google

The first and obvious question I asked myself is why did I (and so many people) set up this tracker on my web site?

My initial answer was a mix of:

  • convenience : it’s easy to set up, there’s a nice interface, you get a lot of details, you don’t have to ask yourself how it’s done, it just works
  • insight : it sounded somewhat important to know who was visiting what content and somehow know about the interest of people visiting

With also a (hopefully not too much) of:

  • pride: are some blog posts popular? if so which one and let’s try to do more like this!

I don’t think those are good enough reasons to add a tracker that sends YOUR data to Google.

Convenience kills diversity

I’m old enough to have witnessed the rise of internet and its availability to (almost) everyone. The first things I did when I could connect was create and host my own web site, it was great and looked ugly!

But while Internet could have been a catalyst for diversity, it turned out to create an over concentration on services and tools that we think are hard to live without because of their convenience (and a human tendency for mimicry).

When your choices are reduced and the mass adoption defines your standards, it’s easy to let it go and pretend you don’t care that much.

I decided to stop pretending that I don’t care. I don’t want to participate in the concentration of web navigation tracking to Google.

Open Internet is at risk

When diversity is endangered so is Open Internet. This idea that a rich ecosystem can bring their own value and be free to grow by using the data they generate or collect is threatened by the GAFA who are building walled gardens around OUR data.

For instance, Google used the GDPR regulation as an excuse to close down the access to the data collected by their (so convenient) services. If a company (or you) wants to access / query this data (YOUR data) then you can only by using their billed tools.

What should have been only a clear win for us people turned out to also benefit those super blocks and threaten diversity and Open Internet even more.

Adding Google Analytics to your web site helps Google have a larger reach and tracking footprint on the whole web: imagine all those millions of web sites visits added together, that’s where the value is for them. No wonder GA is free.

So in this regard too, I decided to stop contributing to the empowerment of Google.

This blog is Tracking Free

So from now on if you want to share your thoughts of just let me know you enjoyed a post on this blog, take the lead on YOUR data and use the comment box.

The choice is yours!

py3status v3.16

Two py3status versions in less than a month? That’s the holidays effect but not only!

Our community has been busy discussing our way forward to 4.0 (see below) and organization so it was time I wrote a bit about that.

Community

A new collaborator

First of all we have the great pleasure and honor to welcome Maxim Baz @maximbaz as a new collaborator on the project!

His engagement, numerous contributions and insightful reviews to py3status has made him a well known community member, not to mention his IRC support 🙂

Once again, thank you for being there Maxim!

Zen of py3status

As a result of an interesting discussion, we worked on defining better how to contribute to py3status as well as a set of guidelines we agree on to get the project moving on smoothly.

Here is born the zen of py3status which extends the philosophy from the user point of view to the contributor point of view!

This allowed us to handle the numerous open pull requests and get their number down to 5 at the time of writing this post!

Even our dear @lasers don’t have any open PR anymore 🙂

3.15 + 3.16 versions

Our magic @lasers has worked a lot on general modules options as well as adding support for i3-gaps added features such as border coloring and fine tuning.

Also interesting is the work of Thiago Kenji Okada @m45t3r around NixOS packaging of py3status. Thanks a lot for this work and for sharing Thiago!

I also liked the question of Andreas Lundblad @aioobe asking if we could have a feature allowing to display a custom graphical output, such as a small PNG or anything upon clicking on the i3bar, you might be interested in following up the i3 issue he opened.

Make sure to read the amazing changelog for details, a lot of modules have been enhanced!

Highlights

  • You can now set a background, border colors and their urgent counterparts on a global scale or per module
  • CI now checks for black format on modules, so now all the code base obey the black format style!
  • All HTTP requests based modules now have a standard way to define HTTP timeout as well as the same 10 seconds default timeout
  • py3-cmd now allows sending click events with modifiers
  • The py3status -n / –interval command line argument has been removed as it was obsolete. We will ignore it if you have set it up, but better remove it to be clean
  • You can specify your own i3status binary path using the new -u, –i3status command line argument thanks to @Dettorer and @lasers
  • Since Yahoo! decided to retire its public & free weather API, the weather_yahoo module has been removed

New modules

  • new conky module: display conky system monitoring (#1664), by lasers
  • new module emerge_status: display information about running gentoo emerge (#1275), by AnwariasEu
  • new module hueshift: change your screen color temperature (#1142), by lasers
  • new module mega_sync: to check for MEGA service synchronization (#1458), by Maxim Baz
  • new module speedtest: to check your internet bandwidth (#1435), by cyrinux
  • new module usbguard: control usbguard from your bar (#1376), by cyrinux
  • new module velib_metropole: display velib metropole stations and (e)bikes (#1515), by cyrinux

A word on 4.0

Do you wonder what’s gonna be in the 4.0 release?
Do you have ideas that you’d like to share?
Do you have dreams that you’d love to become true?

Then make sure to read and participate in the open RFC on 4.0 version!

Development has not started yet; we really want to hear from you.

Thank you contributors!

There would be no py3status release without our amazing contributors, so thank you guys!

  • AnwariasEu
  • cyrinux
  • Dettorer
  • ecks
  • flyingapfopenguin
  • girst
  • Jack Doan
  • justin j lin
  • Keith Hughitt
  • L0ric0
  • lasers
  • Maxim Baz
  • oceyral
  • Simon Legner
  • sridhars
  • Thiago Kenji Okada
  • Thomas F. Duellmann
  • Till Backhaus

Scylla Summit 2018 write-up

It’s been almost one month since I had the chance to attend and speak at Scylla Summit 2018 so I’m relieved to finally publish a short write-up on the key things I wanted to share about this wonderful event!

Make Scylla boring

This statement of Glauber Costa sums up what looked to me to be the main driver of the engineering efforts put into Scylla lately: making it work so consistently well on any kind of workload that it’s boring to operate 🙂

I will follow up on this statement to highlight the things I heard and (hopefully) understood during the summit. I hope you’ll find it insightful.

Reduced operational efforts

The thread-per-core and queues design still has a lot of possibilities to be leveraged.

The recent addition of RPC streaming capabilities to seastar allows a drastic reduction in the time it takes the cluster to grow or shrink (data rebalancing / resynchronization).

Incremental compaction is also very promising as this background process is one of the most expensive there is in the database’s design.

I was happy to hear that scylla-manager will soon be made available and free to use with basic features while retaining more advanced ones for enterprise version (like backup/restore).
I also noticed that the current version was not supporting SSL enabled clusters to store its configuration. So I directly asked MichaĹ‚ for it and I’m glad that it will be released on version 1.3.1.

Performant multi-tenancy

Why choose between real-time OLTP & analytics OLAP workloads?

The goal here is to be able to run both on the same cluster by giving users the ability to assign “SLA” shares to ROLES. That’s basically like pools on Hadoop at a much finer grain since it will create dedicated queues that will be weighted by their share.

Having one queue per usage and full accounting will allow to limit resources efficiently and users to have their say on their latency SLAs.

But Scylla also has a lot to do in the background to run smoothly. So while this design pattern was already applied to tamper compactions, a lot of work has also been done on automatic flow control and back pressure.

For instance, Materialized Views are updated asynchronously which means that while we can interact and put a lot of pressure on the table its based on (called the Main Table), we could overwhelm the background work that’s needed to keep MVs View Tables in sync. To mitigate this, a smart back pressure approach was developed and will throttle the clients to make sure that Scylla can manage to do everything at the best performance the hardware allows!

I was happy to hear that work on tiered storage is also planned to better optimize disk space costs for certain workloads.

Last but not least, columnar storage optimized for time series and analytics workloads are also something the developers are looking at.

Latency is expensive

If you care for latency, you might be happy to hear that a new polling API (named IOCB_CMD_POLL) has been contributed by Christoph Hellwig and Avi Kivity to the 4.19 Linux kernel which avoids context switching I/O by using a shared ring between kernel and userspace. Scylla will be using it by default if the kernel supports it.

The iotune utility has been upgraded since 2.3 to generate an enhanced I/O configuration.

Also, persistent (disk backed) in-memory tables are getting ready and are very promising for latency sensitive workloads!

A word on drivers

ScyllaDB has been relying on the Datastax drivers since the start. While it’s a good thing for the whole community, it’s important to note that the shard-per-CPU approach on data that Scylla is using is not known and leveraged by the current drivers.

Discussions took place and it seems that Datastax will not allow the protocol to evolve so that drivers could discover if the connected cluster is shard aware or not and then use this information to be more clever in which write/read path to use.

So for now ScyllaDB has been forking and developing their shard aware drivers for Java and Go (no Python yet… I was disappointed).

Kubernetes & containers

The ScyllaDB guys of course couldn’t avoid the Kubernetes frenzy so Moreno Garcia gave a lot of feedback and tips on how to operate Scylla on docker with minimal performance degradation.

Kubernetes has been designed for stateless applications, not stateful ones and Docker does some automatic magic that have rather big performance hits on Scylla. You will basically have to play with affinities to dedicate one Scylla instance to run on one server with a “retain” reclaim policy.

Remember that the official Scylla docker image runs with dev-mode enabled by default which turns off all performance checks on start. So start by disabling that and look at all the tips and literature that Moreno has put online!

Scylla 3.0

A lot has been written on it already so I will just be short on things that important to understand in my point of view.

  • Materialized Views do back fill the whole data set
    • this job is done by the view building process
    • you can watch its progress in the system_distributed.view_build_status table
  • Secondary Indexes are Materialized Views under the hood
    • it’s like a reverse pointer to the primary key of the Main Table
    • so if you read the whole row by selecting on the indexed column, two reads will be issued under the hood: one on the indexed MV view table to get the primary key and one on the main table to get the rest of the columns
    • so if your workload is mostly interested by the whole row, you’re better off creating a complete MV to read from than using a SI
    • this is even more true if you plan to do range scans as this double query could lead you to read from multiple nodes instead of one
  • Range scan is way more performant
    • ALLOW FILTERING finally allows a great flexibility by providing server-side filtering!

Random notes

Support for LWT (lightweight transactions) will be relying on a future implementation of the Raft consensus algorithm inside Scylla. This work will also benefits Materialized Views consistency. Duarte Nunes will be the one working on this and I envy him very much!

Support for search workloads is high in the ScyllaDB devs priorities so we should definitely hear about it in the coming months.

Support for “mc” sstables (new generation format) is done and will reduce storage requirements thanks to metadata / data compression. Migration will be transparent because Scylla can read previous formats as well so it will upgrade your sstables as it compacts them.

ScyllaDB developers have not settled on how to best implement CDC. I hope they do rather soon because it is crucial in their ability to integrate well with Kafka!

Materialized Views, Secondary Indexes and filtering will benefit from the work on partition key and indexes intersections to avoid server side filtering on the coordinator. That’s an important optimization to come!

Last but not least, I’ve had the pleasure to discuss with Takuya Asada who is the packager of Scylla for RedHat/CentOS & Debian/Ubuntu. We discussed Gentoo Linux packaging requirements as well as the recent and promising work on a relocatable package. We will collaborate more closely in the future!

py3status v3.14

I’m happy to announce this release as it contains some very interesting developments in the project. This release was focused on core changes.

IMPORTANT notice

There are now two optional dependencies to py3status:

  • gevent
    • will monkey patch the code to make it concurrent
    • the main benefit is to use an asynchronous loop instead of threads
  • pyudev
    • will enable a udev monitor if a module asks for it (only xrandr so far)
    • the benefit is described below

To install them all using pip, simply do:

pip install py3status[all]

Modules can now react/refresh on udev events

When pyudev is available, py3status will allow modules to subscribe and react to udev events!

The xrandr module uses this feature by default which allows the module to instantly refresh when you plug in or off a secondary monitor. This also allows to stop running the xrandr command in the background and saves a lot of CPU!

Highlights

  • py3status core uses black formatter
  • fix default i3status.conf detection
    • add ~/.config/i3 as a default config directory, closes #1548
    • add .config/i3/py3status in default user modules include directories
  • add markup (pango) support for modules (#1408), by @MikaYuoadas
  • py3: notify_user module name in the title (#1556), by @lasers
  • print module information to sdtout instead of stderr (#1565), by @robertnf
  • battery_level module: default to using sys instead of acpi (#1562), by @eddie-dunn
  • imap module: fix output formatting issue (#1559), by @girst

Thank you contributors!

  • eddie-dunn
  • girst
  • MikaYuoadas
  • robertnf
  • lasers
  • maximbaz
  • tobes

py3status v3.13

I am once again lagging behind the release blog posts but this one is an important one.

I’m proud to announce that our long time contributor @lasers has become an official collaborator of the py3status project!

Dear @lasers, your amazing energy and overwhelming ideas have served our little community for a while. I’m sure we’ll have a great way forward as we learn to work together with @tobes 🙂 Thank you again very much for everything you do!

This release is as much dedicated to you as it is yours 🙂

IMPORTANT notice

After this release, py3status coding style CI will enforce the ‘black‘ formatter style.

Highlights

Needless to say that the changelog is huge, as usual, here is a very condensed view:

  • documentation updates, especially on the formatter (thanks @L0ric0)
  • py3 storage: use $XDG_CACHE_HOME or ~/.cache
  • formatter: multiple variable and feature fixes and enhancements
  • better config parser
  • new modules: lm_sensors, loadavg, mail, nvidia_smi, sql, timewarrior, wanda_the_fish

Thank you contributors!

  • lasers
  • tobes
  • maximbaz
  • cyrinux
  • Lorenz Steinert @L0ric0
  • wojtex
  • horgix
  • su8
  • Maikel Punie

Authenticating and connecting to a SSL enabled Scylla cluster using Spark 2

This quick article is a wrap up for reference on how to connect to ScyllaDB using Spark 2 when authentication and SSL are enforced for the clients on the Scylla cluster.

We encountered multiple problems, even more since we distribute our workload using a YARN cluster so that our worker nodes should have everything they need to connect properly to Scylla.

We found very little help online so I hope it will serve anyone facing similar issues (that’s also why I copy/pasted them here).

The authentication part is easy going by itself and was not the source of our problems, SSL on the client side was.

Environment

  • (py)spark: 2.1.0.cloudera2
  • spark-cassandra-connector: datastax:spark-cassandra-connector: 2.0.1-s_2.11
  • python: 3.5.5
  • java: 1.8.0_144
  • scylladb: 2.1.5

SSL cipher setup

The Datastax spark cassandra driver uses default the TLS_RSA_WITH_AES_256_CBC_SHA cipher that the JVM does not support by default. This raises the following error when connecting to Scylla:

18/07/18 13:13:41 WARN channel.ChannelInitializer: Failed to initialize a channel. Closing: [id: 0x8d6f78a7]
java.lang.IllegalArgumentException: Cannot support TLS_RSA_WITH_AES_256_CBC_SHA with currently installed providers

According to the ssl documentation we have two ciphers available:

  1. TLS_RSA_WITH_AES_256_CBC_SHA
  2. TLS_RSA_WITH_AES_128_CBC_SHA

We can get get rid of the error by lowering the cipher to TLS_RSA_WITH_AES_128_CBC_SHA using the following configuration:

.config("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA")\

However, this is not really a good solution and instead we’d be inclined to use the TLS_RSA_WITH_AES_256_CBC_SHA version. For this we need to follow this Datastax’s procedure.

Then we need to deploy the JCE security jars on our all client nodes, if using YARN like us this means that you have to deploy these jars to all your NodeManager nodes.

For example by hand:

# unzip jce_policy-8.zip
# cp UnlimitedJCEPolicyJDK8/*.jar /opt/oracle-jdk-bin-1.8.0.144/jre/lib/security/

Java trust store

When connecting, the clients need to be able to validate the Scylla cluster’s self-signed CA. This is done by setting up a trustStore JKS file and providing it to the spark connector configuration (note that you protect this file with a password).

keyStore vs trustStore

In SSL handshake purpose of trustStore is to verify credentials and purpose of keyStore is to provide credentials. keyStore in Java stores private key and certificates corresponding to the public keys and is required if you are a SSL Server or SSL requires client authentication. TrustStore stores certificates from third parties or your own self-signed certificates, your application identify and validates them using this trustStore.

The spark-cassandra-connector documentation has two options to handle keyStore and trustStore.

When we did not use the trustStore option, we would get some obscure error when connecting to Scylla:

com.datastax.driver.core.exceptions.TransportException: [node/1.1.1.1:9042] Channel has been closed

When enabling DEBUG logging, we get a clearer error which indicated a failure in validating the SSL certificate provided by the Scylla server node:

Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

setting up the trustStore JKS

You need to have the self-signed CA public certificate file, then issue the following command:

# keytool -importcert -file /usr/local/share/ca-certificates/MY_SELF_SIGNED_CA.crt -keystore COMPANY_TRUSTSTORE.jks -noprompt
Enter keystore password:  
Re-enter new password: 
Certificate was added to keystore

using the trustStore

Now you need to configure spark to use the trustStore like this:

.config("spark.cassandra.connection.ssl.trustStore.password", "PASSWORD")\
.config("spark.cassandra.connection.ssl.trustStore.path", "COMPANY_TRUSTSTORE.jks")\

Spark SSL configuration example

This wraps up the SSL connection configuration used for spark.

This example uses pyspark2 and reads a table in Scylla from a YARN cluster:

$ pyspark2 --packages datastax:spark-cassandra-connector:2.0.1-s_2.11 --files COMPANY_TRUSTSTORE.jks

>>> spark = SparkSession.builder.appName("scylla_app")\
.config("spark.cassandra.auth.password", "test")\
.config("spark.cassandra.auth.username", "test")\
.config("spark.cassandra.connection.host", "node1,node2,node3")\
.config("spark.cassandra.connection.ssl.clientAuth.enabled", True)\
.config("spark.cassandra.connection.ssl.enabled", True)\
.config("spark.cassandra.connection.ssl.trustStore.password", "PASSWORD")\
.config("spark.cassandra.connection.ssl.trustStore.path", "COMPANY_TRUSTSTORE.jks")\
.config("spark.cassandra.input.split.size_in_mb", 1)\
.config("spark.yarn.queue", "scylla_queue").getOrCreate()

>>> df = spark.read.format("org.apache.spark.sql.cassandra").options(table="my_table", keyspace="test").load()
>>> df.show()

A botspot story

I felt like sharing a recent story that allowed us identify a bot in a haystack thanks to Scylla.

 

The scenario

While working on loading up 2B+ of rows into Scylla from Hive (using Spark), we noticed a strange behaviour in the performances of one of our nodes:

 

So we started wondering why that server in blue was having those peaks of load and was clearly diverging from the two others… As we obviously expect the three nodes to behave the same, there were two options on the table:

  1. hardware problem on the node
  2. bad data distribution (bad schema design? consistent hash problem?)

We shared this with our pals from ScyllaDB and started working on finding out what was going on.

The investigation

Hardware?

Hardware problem was pretty quickly evicted, nothing showed up on the monitoring and on the kernel logs. I/O queues and throughput were good:

Data distribution?

Avi Kivity (ScyllaDB’s CTO) quickly got the feeling that something was wrong with the data distribution and that we could be facing a hotspot situation. He quickly nailed it down to shard 44 thanks to the scylla-grafana-monitoring platform.

Data is distributed between shards that are stored on nodes (consistent hash ring). This distribution is done by hashing the primary key of your data which dictates the shard it belongs to (and thus the node(s) where the shard is stored).

If one of your keys is over represented in your original data set, then the shard it belongs to can be overly populated and the related node overloaded. This is called a hotspot situation.

tracing queries

The first step was to trace queries in Scylla to try to get deeper into the hotspot analysis. So we enabled tracing using the following formula to get about 1 trace per second in the system_traces namespace.

tracing probability = 1 / expected requests per second throughput

In our case, we were doing between 90K req/s and 150K req/s so we settled for 100K req/s to be safe and enabled tracing on our nodes like this:

# nodetool settraceprobability 0.00001

Turns out tracing didn’t help very much in our case because the traces do not include the query parameters in Scylla 2.1, it is becoming available in the soon to be released 2.2 version.

NOTE: traces expire on the tables, make sure your TRUNCATE the events and sessions tables while iterating. Else you will have to wait for the next gc_grace_period (10 days by default) before they are actually removed. If you do not do that and generate millions of traces like we did, querying the mentioned tables will likely time out because of the “tombstoned” rows even if there is no trace inside any more.

looking at cfhistograms

Glauber Costa was also helping on the case and got us looking at the cfhistograms of the tables we were pushing data to. That proved to be clearly highlighting a hotspot problem:

histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                             (micros)          (micros)           (bytes)                  
50%             0,00              6,00              0,00               258                 2
75%             0,00              6,00              0,00               535                 5
95%             0,00              8,00              0,00              1916                24
98%             0,00             11,72              0,00              3311                50
99%             0,00             28,46              0,00              5722                72
Min             0,00              2,00              0,00               104                 0
Max             0,00          45359,00              0,00          14530764            182785

What this basically means is that 99% percentile of our partitions are small (5KB) while the biggest is 14MB! That’s a huge difference and clearly shows that we have a hotspot on a partition somewhere.

So now we know for sure that we have an over represented key in our data set, but what key is it and why?

The culprit

So we looked at the cardinality of our data set keys which are SHA256 hashes and found out that indeed we had one with more than 1M occurrences while the second highest one was around 100K!…

Now that we had the main culprit hash, we turned to our data streaming pipeline to figure out what kind of event was generating the data associated to the given SHA256 hash… and surprise! It was a client’s quality assurance bot that was constantly browsing their own website with legitimate behaviour and identity credentials associated to it.

So we modified our pipeline to detect this bot and discard its events so that it stops polluting our databases with fake data. Then we cleaned up the million of events worth of mess and traces we stored about the bot.

The aftermath

Finally, we cleared out the data in Scylla and tried again from scratch. Needless to say that the curves got way better and are exactly what we should expect from a well balanced cluster:

Thanks a lot to the ScyllaDB team for their thorough help and high spirited support!

I’ll quote them conclude this quick blog post: