Some learnings: July 2019

Monday, July 29, 2019

Some Cassandra Interview Questions

These are the list of some important interview questions, that we can expect from an Interview.

1) Repairs in Cassandra

Scenario 1: Internal read repair

Scenario 2: Manual repair, by nodetool

Scenario 3: Read repair

2) Reads Path

Scenario 1: Application to cassandra, implies depending on the driver policy like DCAware,TokenAware etc. Query goes to Any node in the cluster called as co-ordinator node and is responsible for acknowledge of data back to client.

Scenario 2: Read Architecture in a node. Flows from row_cache, bloom_filter, Key_caches, Partition_summaries, Partiotion_index, sstables.

3) Write Path

Commit_log then the mem_table which is in memory table structure where data is sorted as per the table partiton.

4) Installation types

tar, rpm, aptitude

5) jvm tuning

heap configuration, garbage collection with jvm G1GC, Concurrent GC and how it works

6) Snitch and its types and their functionality

simplesnitch, gossip, ec2snitch etc.

7) memtable allocations.

off heap, on heap etc

8) Adding, Decommissioning, Removing nodes.

9) Comparisons in Datastax, Cassandra.

***

Sunday, July 28, 2019

How disk/memory space affects long column names in Cassandra

Will disk space be affected with long column name in Cassandra/Datastax, Yes.

Usually a column is stored as a tuple with name, value and timestamp. If column's name was large then it needs to allocate the space for each value that it persists.

Here, we also need the consideration of keycaches, row caches and memtable consideration as this also replicates the same and uses more space in memory.

A simple show up on disk space utilization.

CREATE TABLE test_pp.t1 (columnoftablet1inthekeyspacetestpp text PRIMARY KEY);

INSERT INTO t1 (columnoftablet1inthekeyspacetestpp) VALUES ( 'a;sdkfjalksjfdl;ewjiekdnvasdif') ;

$nodetool flush

$ls -l *Data.db

-rw-r--r-- 1 cassandra cassandra 59 Jun 17 11:58 mc-1-big-Data.db

INSERT INTO t1 (columnoftablet1inthekeyspacetestpp) VALUES ( 'k') ;

$nodetool flush

$ls -l *Data.db

-rw-r--r-- 1 cassandra cassandra 30 Jun 17 11:59 mc-2-big-Data.db

We can see the size difference of data files where first insert is of 59 bytes and send insert is off 30 bytes.

***

Saturday, July 27, 2019

DSE database management services

DSE database management services:

1) Performance service:

This collects and Organizes performance diagnostic information into data dictionary tables.

These metrics can be related to Cassandra, DSE Search, DSE Analytics.

Some of the metrics that can be gathered are

a) Slow queries

b) Latency metrics for non-system keyspaces

c) Node/Cluster wide life time metrics by table and keyspace

d) Statistics like number of SSTables, latency, partition sizes.

e) Read/Write activity on a per client, per-node

f) Bottle necks in DSE Search

g) Resources used in DSE Analytics

dse_perf is the keyspace that stores performance metrics. Alter the keyspace with desired replication strategy and replication factor.

ALTER KEYSPACE "dse_perf" WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 2, 'dc2' : 2};

Types of diagnostic metrics that can be collected.

Type 1: Cassandra Performance service diagnostic table refernce:

https://docs.datastax.com/en/archived/datastax_enterprise/4.8/datastax_enterprise/mngServ/performSrv/prfSrvTblRef.html

example: Enable in dse.yaml

node_slow_log table: Queries on a node exceeding the threshold_ms parameter.

Type 2: DSE serach performance service diagnostic table reference:

https://docs.datastax.com/en/archived/datastax_enterprise/4.8/datastax_enterprise/mngServ/solrPerformSrv/solrPrfDiagTOC.html

example: Enable in dse.yaml

solr_slow_sub_query_log_options: Reports distributed sub-queries that take longer than specified time.

Temporarily enable by, dsetool perf solrslowlog enable

solr_indexing_error_log_options: Records errors that occur during document indexing.

Type 3: Monitoring Spark with Spark Performance Objects:

https://docs.datastax.com/en/archived/datastax_enterprise/4.8/datastax_enterprise/mngServ/performSrv/sparkPerformanceObjectsOverview.html

This collects data associated with spark cluster and spark applications.

example: Enable in dse.yaml

spark_cluster_info_options & spark_application_info_options

***