hattajr

Create XFS Disk Partition in Linux

2024-07-02T00:00:00+00:00

Create XFS Disk Partition in Linux

Jul 2, 2024

Check disk drive and get the name ‘/dev/sdX’
```
sudo fdisk -l
```

Create n number of partitions

sudo fdisk /dev/sdX
# n, p, default, default, +size
# do it n times

Set all partions to xfs format

for i in {1...n}; do sudo mkfs.xfs /dev/sdX$i; done

Mount all the partitions
```
sudo mkdir -p /mnt/minio/disk
```

For persitance when reboot, copy all the text below to /etc/fstab

/dev/sdX1 /mnt/minio/disk1 xfs defaults 0 0
/dev/sdX2 /mnt/minio/disk2 xfs defaults 0 0
/dev/sdX3 /mnt/minio/disk3 xfs defaults 0 0
/dev/sdX4 /mnt/minio/disk4 xfs defaults 0 0

Validate the partition
```
df -h
```

Apache Superset Query Data From Deltalake

2023-09-20T00:00:00+00:00

Apache Superset Query Data From Deltalake

Sep 20, 2023

Install minio client using pip.

Setup custom config at superset/docker/pythonpath_dev/superset_config.py

def build_config(env_file=".env"):
    from dotenv import dotenv_values

    config = dotenv_values(env_file)
    if any(
        [
            config.get("AWS_ACCESS_KEY_ID") is None,
            config.get("AWS_SECRET_ACCESS_KEY") is None,
            config.get("AWS_ENDPOINT_URL") is None,
        ]
    ):
        raise ValueError("AWS credentials not found in .env file")
    return config


def create_minio_client(env_file=".env"):
    """
    Creates a minio client using the credentials stored in the .env file.
    """
    from minio import Minio

    config = build_config(env_file=env_file)

    # remove http:// because Minio doesn't like it
    endpoint = config["AWS_ENDPOINT_URL"].split("//")[1]
    return Minio(
        endpoint,
        access_key=config["AWS_ACCESS_KEY_ID"],
        secret_key=config["AWS_SECRET_ACCESS_KEY"],
        secure=False,
    )


def create_minio_file_url(bucket_name, table_path):
    from datetime import timedelta
    from deltalake import DeltaTable

    storage_options = {
        "AWS_ACCESS_KEY_ID":
        "AWS_SECRET_ACCESS_KEY":
        "AWS_ENDPOINT_URL":
        "AWS_REGION":
        "AWS_S3_ALLOW_UNSAFE_RENAME": "true",
        "AWS_STORAGE_ALLOW_HTTP": "true",
    }
    dt = DeltaTable(table_path, storage_options=storage_options)

    env_file = __file__.replace("superset_config.py", ".env")
    minioClient = create_minio_client(env_file=env_file)

    file_urls = []

    # Now you have a dictionary of object names and their respective download URLs
    for url in dt.file_uris():
        url = url.replace(f"s3://{bucket_name}/", "")
        file_urls.append(
            minioClient.presigned_get_object(bucket_name, url, expires=timedelta(days=1))
        )
    return file_urls

JINJA_CONTEXT_ADDONS = {"delta_table": create_minio_file_url}

In Superset SQL-Lab.

SELECT test_id, test_actual_duration from read_parquet({{delta_table('spatch', <table-path>)}}) LIMIT 7;

Databricks Webinar - Deltalake

2023-04-18T00:00:00+00:00

Databricks Webinar - Deltalake

Apr 18, 2023

Day 2

HIVE METASTORE (?)
deltalake managed and external table - managed → if you dropped table, both parquet and table metadata is dropped - external → only table metadata is dropped

Deltalake described table

 DESCRIBE DETAIL '/data/events/'
 
 DESCRIBE DETAIL eventsTable
 
 DESCRIBE [SCHEMA] EXTENDED ${da.schema_name}_default_location;

Create table

CREATE SCHEMA IF NOT EXISTS ${da.schema_name}

what is the difference with DeltaTable.createIfNotExists ?
Define table schema while creating table is recommended
SQLite → a database that use local file system
Using CREATE OR REPLACE TEMP VIEW {table_name} ({all_col_schema}) to read and infer schema before creating table or you can use CREATE OR REPLACE TABLE sales_unparsed (find out more about this and CTAS) [link]
You can enrich your delta table by adding more metadata information - you can use pyspark.sql.functions as F to get the input_file_name or current_timestamp to enrich your table
AS A BEST PRACTICE, DONT USE PARTITIONING IN DELTALAKE IF THE TABLE IS SMALL
You can use DEEP or SHALLOW clone of delta table source to use it for model development or testing.
Avoid to use Spark RDD, just use the higher level API because it is included Query optimization
You can use sparkdf.schema to get the schema and use the schema to create another table
df.collect get all data from all executor to driver
you can use df.creteaOrReplaceTempView to use a sql spark API which create temporary table in memory
Spark dataframe is immutable
count(*) count null value, use count(colname) to skipped null value - conver from timestamp to date format
you can use .to access nested data
infer all table or dataframe to pyspark native schema
The reason why many spark schema use struct to reduce the memory usage, if you want to explode consider to select only needed column. explode dataframe is expensive!

Day 3 (20230419)

Incremental data ingestion with Auto Loader
To enable schema evolution for future need, you need you track the schema evolution (ref: deltalake schema evolution format)
In order to avoid data throw due to unmatched schema or data type you can create a column called _rescued_data (check how to use)
schema_hints
checkpoint is really importance in structure streaming
Streaming should be - High available - Reply-able - Durable - Idemponent

You can view the streaming dataframe in streaming manner using readStream(...).createOrReplaceTempView(). the you can check or applied the data transformation using spark sql API. Then you can write it back using spark.table("tempViewname").writeStream(...) (see. DL 6.3L)

 (spark.readStream
     .table("bronze")
     .createOrReplaceTempView("streaming_tmp_vw"))
 
 %sql
 SELECT * FROM streaming_tmp_vw
 
 for s in spark.streams.active:
     print("Stopping " + s.id)
     s.stop()
     s.awaitTermination()
 
 # operation
 %sql 
 SELECT device_id, count(device_id) AS total_recordings
 FROM streaming_tmp_vw
 GROUP BY device_id

Streaming trigger(availableNow)
Streaming trigger("once") is useful to run the program once you need. Because it store the checkpoint, so next time it run, it will get the data from the last checkpoint. But its recommended to use trigger("availableNow").

MULTI-HOP Architecture

How to Test/Debug Spark Streaming

2022-04-30T00:00:00+00:00

How to Test/Debug Spark Streaming

Apr 30, 2022

import time

def batch_function(cdf_df: DataFrame, batch_id):
    _print(cdf_df.count())

query = (
    spark.readStream
    .format("delta")
    .options(readChangeFeed="true", maxBytesPerTrigger="1K")
    .table("_temp_max_bytes_calc")
    .select("id", "year", "month")
    .drop_duplicates()
    .writeStream
    .foreachBatch(batch_function)
    .trigger(processingTime="5 seconds")
    .queryName("my query")
    .start()
)

time.sleep(5)
while query.isActive:
    stop_conditions = [
        not query.status["isDataAvailable"],
        not query.status["isTriggerActive"],
        query.status["message"] != "Initializing sources", ]

    if all(stop_conditions):
        query.stop()

    time.sleep(1)

query.awaitTermination(10)

Software Development Quality

2022-04-09T00:00:00+00:00

Software Development Quality

Apr 9, 2022

Apache Spark Configuration Experiences

2021-08-05T00:00:00+00:00

Apache Spark Configuration Experiences

Aug 5, 2021

Some important spark configuration

--conf spark.driver.memory=90g \
--conf spark.sql.files.maxPartitionBytes=4194304 \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=/opt/airflow/dags/spark-events

Partition configuration - Input partition - control the input partition by using spark.sql.files.maxPartitionBytes - default is 128mb, but if you will have explode operation, consider to decrease this value - control the shuffle partition by using spark.sql.shuffle.partitions - default is 200 partition, but if you will have explode operation, consider to decrease this value - control the output partition by using .option("maxRecordsPerFile", "x")

Estimate the number of partition

import numpy as np


print ("\n\n*********************************************************************************")

"""
IF THE SHUFLLE PARTITION IS NOT CHANGE IN THE SPARK UI, PLEASE CHECK THE FOLLOWING:
- REMOVE THE TABLE IN MINIO UNDER THE SAME NAME OF THE RUNNING SPARK APP
"""
CONS_SHUFFLE_RATIO_ROWS = 1/21
CONST_MB_TO_BYTES = 1024 * 1024

total_num_rows = 5932823 <- you only need this as input

seed_no = 123
num_cores = int(32)
total_available_memory_gb= 125
reserved_memory_gb = 35

target_input_max_bytes_per_file_mb = 4
target_input_max_bytes_per_file_bytes = target_input_max_bytes_per_file_mb * CONST_MB_TO_BYTES

max_records_per_file = 1_000

num_shuffle_partitions = int(np.ceil(total_num_rows * CONS_SHUFFLE_RATIO_ROWS))

print(f"{'SEED NO':<30} : {seed_no}")
print(f"{'NUM CORES':<30} : {num_cores}")
print(f"{'TOTAL MEMORY DRIVER':<30} : {total_available_memory_gb-reserved_memory_gb} gb")
print (f"{'MEMORY PER CORE':<30} : {(total_available_memory_gb - reserved_memory_gb)/num_cores} gb")

print(f"{'MAX INPUT SIZE':<30} : {target_input_max_bytes_per_file_mb} mb / {target_input_max_bytes_per_file_bytes} bytes")
print (f"{'SHUFFLE PARTITION':<30} : {num_shuffle_partitions} partitions")
print (f"{'MAX RECORDS PER FILE':<30} : {max_records_per_file} records")
print ("*********************************************************************************\n\n")

Setup Hadoop Cluster

2021-07-08T00:00:00+00:00

Setup Hadoop Cluster

Jul 8, 2021

make a new useradd for hadoop master useradd -m its
give the root access using visudo its ALL=(ALL:ALL) ALL
change the hostname using sudo hostname

https://www.redhat.com/sysadmin/change-hostname-linux
download openjdk 8 or 11 and extract the tar
- jdk download link - https://www.oracle.com/java/technologies/downloads/ - https://jdk.java.net/archive/
Move extracted folder to usr/local/ or /opt/ so everybody can access java

add env variable and put in bashr also add path

export JAVA_HOME=/usr/local/jdk-18.0.1.1
export PATH=$PATH:$JAVA_HOME/bin

mapping nodes
configuring ssh key to all slave
download and install hadoop (untar using tar -xzf <.gz.tar file>
- hadoop mirror link: https://dlcdn.apache.org/hadoop/common/
configure hadoop
- hadoop .xml file can be found in hadoop/etc/hadoop/
permission problem and copy hadoop to another node
- in order to resyn, you need to mkdir hadoop in slave node with chown -R its hadoop
sudo chmod -R 777 opt

rsync -avzhP /opt/hadoop/hadoop-3.3.3 hadoop-slave-01@host:/opt/hadoop

Important Notes

don’t forget to set up uniform /etc/host for master and all nodes
to format or restart Hadoop make sure you use bin/hdfs namenode -format
every restart make sure to remove dfs directory
the hadoop/etc/hadoop/worker in all nodes shoud be hostname don’t use localhost
check dir chown or chmod
without avro 16mins → 6.5mb
check hadoop directory bin/hdfs dfs -ls /
if the hadoop:9000 is not in nestat please check the dfs dir. hadoop:9000 only working if the dfs dir is available (this behavior usually happens when you remove the dfs after bin/hdfs namenode -format command is executed).
if one server/datanode is down, use hdfs --daemon start datanode in the node.
ensure pyarrow installation

if error below, check another version of JDK

ERROR Cannot set priority of resourcemanager process at <>

when the datanode is not detected or the datanode is now shown in web UI, please remove the dfs directory in that datanode and stop, format, start again

if error below happen → in hadoop master node make a dir bin/hdfs dfs -mkdir /raw and bin/hdfs dfs -chmod -R 777 /raw

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/raw":its:supergroup:drwxrwxr-x

if error below happen → in the client set the env variable export HADOOP_USER_NAME=

22/07/26 03:41:44 ERROR MicroBatchExecution: Query [id = 81b7eb6c-a753-4f69-904e-6ed1af5e0721, runId = a9b235df-e53a-4ef6-a02d-b22b9f8fbd2d] terminated with error
org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/":its:supergroup:drwxr-xr-x

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/":its:supergroup:drwxr-xr-x

References

https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm

https://dlcdn.apache.org/hadoop/common/

hdfs-site-xml

<configuration>
<property>
<name>dfs.datanode.data.dirname>
<value>/opt/hadoop/hadoop/dfs/name/datavalue>
<final>truefinal>
property>

<property>
<name>dfs.name.dirname>
<value>/opt/hadoop/hadoop/dfs/namevalue>
<final>truefinal>
property>

<property>
    <name>dfs.permissionsname>
    <value>falsevalue>
property>

<property>
<name>dfs.replicationname>
<value>1value>
property>
configuration>

Integrate TMUX into VSCODE Terminal

2021-07-02T00:00:00+00:00

Integrate TMUX into VSCODE Terminal

Jul 2, 2021

Put the settings below into .vscode/settings.json json { "terminal.integrated.profiles.linux": { //...existing profiles... "tmux": { "path": "tmux", "args": ["attach-session", "-d", "-t", ":${workspaceFolderBasename}"] }, }, "terminal.integrated.defaultProfile.linux": "tmux", "terminal.integrated.fontSize": 11 }

TDD Rules

2021-05-19T00:00:00+00:00

TDD Rules

May 19, 2021

Rules of TDD

You are not allowed to write alien production code before writing failing tests.
You are not allowed to write more test code than is required to fail.
You are not allowed to write more code than is required to pass the failing test.

MapReduced explained in 41 Words

2021-03-15T00:00:00+00:00

MapReduced explained in 41 Words

Mar 15, 2021

Goal: Count the number of books in the library.

Map: You count up shelf #1, I count up shelf #2. (The more people we get, the faster this part goes.)

Reduce: We all get together and add up our individual counts.

References: https://www.chrisstucchio.com/blog/2011/mapreduce_explained.html

hattajr

Create XFS Disk Partition in Linux

Apache Superset Query Data From Deltalake

Databricks Webinar - Deltalake

Day 1

Day 2

Day 3 (20230419)

How to Test/Debug Spark Streaming

Software Development Quality

Apache Spark Configuration Experiences

Setup Hadoop Cluster

Important Notes

References

hdfs-site-xml

Integrate TMUX into VSCODE Terminal

TDD Rules

Rules of TDD

MapReduced explained in 41 Words