Thursday, February 18, 2021

Unloading and Loading of data from Datastax Enterprise 6.8 using dsbulk

DSBulk (Datastax bulk loader) is used to load, unload and count the data from Cassandra DB.

Below is the command to run the load or unload using configuration files.

nohup dsbulk unload -h  100.37.24.174, 1100.37.24.175 -maxErrors 1000 -u cassandra -p cassandra -f /datastax/toolbox/dsbulk-1.3.3/conf/unload_details.conf &

The parameters used below are better performed in our environment.

Configuration file to unload data:

[cassandra@localhost]$ cat unload_details.conf
dsbulk {
        connector.name = "json"
        connector.json.url = "/cassandra/backup/dsbulk_unload/"
        connector.json.fileNameFormat = "output-%0,6d.json"
        connector.json.maxRecords = 10000
        connector.json.generatorFeatures = { ESCAPE_NON_ASCII: true, QUOTE_FIELD_NAMES: true }
        schema.keyspace = "bypramod_keyspace"
        schema.table = "bypramod_table"
        executor.maxPerSecond = 300
        executor.maxInFlight = 50
        executor.continuousPaging.enabled = false
        driver.query.fetchSize = 5000
        driver.policy.maxRetries = 30
        driver.socket.readTimeout = 240000
}

Configuration file to load data:

[cassandra@localhost]$ cat load_details.conf
dsbulk {
        connector.name = "json"
        schema.keyspace = "bypramod_keyspace"
        schema.table = "bypramod_table2"
        connector.json.url = "/cassandra/backup/dsbulk_unload/"
        connector.json.fileNameFormat = "output-%0,6d.json"
        connector.json.maxRecords = 10000
        connector.json.generatorFeatures = { ESCAPE_NON_ASCII: true, QUOTE_FIELD_NAMES: true }
        executor.maxPerSecond = 2500
        executor.maxInFlight = 30
        executor.continuousPaging.enabled = false
        driver.query.fetchSize = 100
        driver.policy.maxRetries = 30
}


Note:

The format string "%0,6d" is a placeholder used in programming languages, particularly in languages like Python or C, to represent a formatted numerical value. Here's a breakdown of each part:


%: This is the format specifier that indicates a placeholder.

0: This is a flag that specifies zero-padding. In this case, it means that if the number has fewer than 6 digits, it will be padded with zeros on the left.

,: This is an optional thousands separator, but in this context, it separates the padding specifier (0) from the width specifier (6).

6d: This specifies the width of the field. It indicates that the numerical value should be formatted to take up at least 6 characters, including padding if necessary.

d: This is the conversion specifier for a decimal integer.

In summary, "%0,6d" is used to format an integer by zero-padding it on the left, ensuring that it takes up at least 6 characters. If the number has fewer than 6 digits, it will be padded with zeros.


DSBulk with SSL:
// Create SSL configuration file for DSBULK vi /cassandra/toolbox/dsbulk/dsbulk_ssl.conf
dsbulk {
	connector.name = "csv"
}
datastax-java-driver {
	advanced {
		ssl-engine-factory {
			class = DefaultSslEngineFactory
			truststore-password = "truststore"
			truststore-path = "/cassandra/keystores/client.truststore"
			hostname-validation="false"
		}
	}
}
// Use DSBULK commands with the SSL config file. $ /cassandra/toolbox/dsbulk/bin/dsbulk count -f /cassandra/toolbox/dsbulk/dsbulk_ssl.conf -u cassandra -p cassandra -k bypramod_keyspace -t pramod --verbosity 0 --stats.modes global

***