Raw Practice Notes Dump — AWS Data Engineer Associate Week 1

# If anyone finds this useful

22 min readMar 22, 2024

Am currently studying for AWS Data Engineer Associate
These are my raw notes — I haven’t cleaned them yet (will do so soon i think)
Whenever I do a practice question wrongly, or see something I don’t understand, I create a new entry in this list
Hopefully this helps you in some way if you’re studying for the AWS Data Engineer Associate Certification

I’ll probably tidy this up more in the next few days

SQS Queue Stuff

Events that can remove messages from SQS queue

DeleteMessage API call — typical method to remove messages from a queue
maxReceivedCount has been reached for message — how many times message can be received before message is deleted and added to dead-letter queue
queue is purged → remove all messages in queue without deleting queue (usually to reset an app)

Standard SQS queue VS first in first out (FIFO) queue

standard → unlimited throughput — nearly unlimited number of API calls per second per API action
FIFO → high throughput → supports up to 300 messages per second per API method. If batching, 3000 messages per second (300 batches of 10 messages)
standard → at-least-once delivery (might be more than once)
FIFO → exactly once processing (one and only time, in order)
standard → best effort ordering → might be in slightly different order
FIFO → first in first out delivery → order strictly preserved

SQS visibility timeout meaning

when consumer A is processing message in queue, SQS temporarily hides this message from other consumers.
by setting visibility timeout on message
visibility timeout → time period where other consumers (not A) are prevented from receiving and processing this particular message
default is 30 seconds, max is 12 hours
eg. if app takes 10 seconds to process a message, but visibility timeout is 2 seconds, this message might be consumed by another consume, leading to duplicates

Dead letter queues

contains messages that cant be processed successfully from other queues (used for debugging)
DLQ for FIFO queue must also be FIFO

Some Amazon SQS Actions/Attributes

DeleteMessage → delete specified message from specified queue (use receiptHandle instead of messageId)
PurgeQueue → deletes available messages in queue (by QueueUrl)
ReceiveMessage → retrieve up to 10 messages from specified queue
SendMessage → deliver message to specified queue

NOTE — ReceiveMessage API call does not delete message from queue. consumer A get data → visibility timeout invoked → other consumers cannot see this message for a period of time → consumer A finishes processing the message → consumer A tells queue to delete message using DeleteMessage

Metadata store w fine-grained permissions

Centralized metadata storage solution (reliable, scalable) + ensure that fine-grained permissions can be controlled at database, table, col, row, cell levels

Use AWS Lake Formation to create data lake and data catalog + control access using Lake formation data filters

Stuff that doesn’t work

AWS Glue → control permissions at table level with Glue Data Catalog, but cannot control at database, table, column, row, cell levels
Aurora DB → also need to deploy Hive app (high operational overhead)
Amazon EMR → cannot control access to Hive metastore

AWS Glue Data Catalog with Amazon Athena

can only limit access to databases and tables. Fine-grain access controls apply at table level, cannot limit access to individual partitions

Apache Hive on EMR

open source, distributed, provides data warehouse-like query capabilities
enables users to read/write/manage lots of data using SQL-like syntax
lots of data == petabyes (1000 terabyte == 1 petabyte)

AWS Lake Formation → to create data lakes

data lake == centralized repo to store structured/unstructured data at any scale (and any format)
can manage fine-grained access control for data lake data on S3 + metadata in AWS glue data catalog
fine-grained access controls goes to row, col, cell level even

Redshift Stuff

Redshift → petabyte-scale data warehouse service

Note: only pay when we query, no need pay if data is idle

Data warehouse → giant database optimized for analytics

store large amounts of data (current & historical) from multiple sources
ETL processes move data from other sources into data warehouse
fixed relational schema, but sometimes semistructured is ok

Redshift COPY Command

loads data into table from data files (eg S3, EMR) or DynamoDB
ie other source data → redshift table

Redshift JOIN command

like SQL → combine data from 2 or more tables

Redshift Materialized View

contains preocmputed result set (based on SQL query etc)
SELECT this materialized view → precomputed result returned without accessing base tables → faster
useful for speeding up queries that are predictable and repeated

Redshift UNLOAD command

unload result of query to TXT/JSON/CSV/PARQUET
default is | delimited

Cron stuff

Cloudshell

browser-based, pre-authenticated shell we can launch directly from AWS management console.
can run AWS CLI commands without having to install locally

EventBridge Scheduler

serverless schedular allows us to create/run/manage tasks

Cloud9

integrated development environment (IDE)
usable through web browser

Kinesis Data Streams with AWS Lambda

To increase performance when reading kinesis data streams w lambda:

increase number of shards for kinesis data stream
test different parallelization factor settings to find most performant
register lambda function as consumer w enhanced fan-out

Lambda default behaviour

lambda create 1 concurrent instance of lambda function for each shard
3 shards → 3 concurrent functions

Kinesis Data Streams (KDS)

a kinesis data stream is a set of shards
producer → sends data to kinesis data streams eg. log data
consumer → get data from kinesis data streams
shard → uniquely identified sequence of data records in a stream. Stream composed of 1+ shards. Each shard provides fixed unit of capacity eg. 5 transactions per second.
data capacity of stream depends on number of shards WE specify
partition key → used to group data by shard within stream
sequence number → each data record has sequence number which is unique per partition-key within its shard
server-side encryption → KDS auto encrypt data from producer as it enters a stream (KMS master keys for encryption)

KDS — Shared throughput VS Dedicated throughput (enhanced fanout)

throughput → volume of data passing through network at given period
shared throughput → no need propagation delays under 200ms
dedicated throughput → consumers that use this dont need to share with other consumers
enhanced fanout → consumers to receive records from stream with throughput of up to 2 MB of data per second per shard
can register up to 20 consumers per stream to use enhanced fanout

Lambda and KDS

lambda polls each shard in kinesis stream using HTTP protocol
shared throughput w other consumers of shard
lambda reads records from data stream, invokes function synchronously (read in batches)
to increase concurrency → process multiple batches in parellel (lambda can process up to 10 batches in each shard simultaneously)
ParallelizationFactor → process ONE shard with MORE THAN ONE lambda function (shard from KDS or dynamoDB streams)
eg. set ParallelizationFactor = 2 → 200 concurrent lambda functions at max to process 100 kinesis data shards → helps scale up processing througput when IteratorAge is high
IteratorAge → age of last record in all GetRecord calls made in stream, regardless of which consumer called it
high iteratorAge → data is not being processed in timely manner

KDS Capacity modes

determines how data stream is managed + $$$
on-demand → no capacity planning, auto scale number of shards
provisioned → WE specify number of shards
unpredictable nature of data → on-demand more performant

Lambda reserved concurrency

number of in-slight requests that function is currently handling
reserved concurrency → max number of concurrent instances allocated to your function. no add $$$
provisioned concurrency → number of pre-initialized execution envs allocated to function, ready to respond to incoming requests. add $$$

Some Storage stuff

Elastic Block Store (EBS)

scalable block storage that can be used with EC2 instances
EBS volumes → we attach this to EC2 instance (kinda like hard drive)
EBS snapshots → point-in-time backups of EBS volumes

DynamoDB (serverless)

fully managed, noSQL db service + distributed database
DynamoDB accelerator (DAX) → in memory cache

Redshift Stuff

Redshift → petabyte-scale data warehouse service

resources automatically provisioned, capacity auto scaled
no $$$ when data is idle
redshift query editor v2 OR other BI tool eg. quicksight
cluster → one or more compute nodes
leader node → manages comms with client programs + distributes SQL statements to compute nodes to get results

Redshift Performance features

Massively parellel processing (MPP) → multiple compute nodes handle all query processing at once. Redshift distributes rows of table to compute nodes → faster
Columnar data storage → reduce disk IO requirements + better perf
Data compression → reduce storage requirements. when query, compressed data read into memory, uncompressed during query run
Query optimizer →
Result caching → caches results of certain query types in memory

Redshift Materialized views

complex queries → eg multiple table joins and aggregations
materialized view → precomputed result set (one or more tables)
can SELECT a materialized view → results w/o touching base tables
useful for speeding queries that are repeated and predictable

Federated queries → can query across operational databases, warehouses, lakes + integrate queries with other external DBs

Redshift Data Sharing

securely share access to live data across redshift clusters
data sharing write-access use case

Redshift Manual Snapshots

ability to create point-in-time recovery backups (not cost effective)

Redshift Serverless vs Provisioned

serverless → auto provisioned, auto scaled
provisioned → self manage compute and leader nodes
serverless → pay for workloads run
provisioned → billing occurs per second when cluster isnt paused

Redshift User Defined function (UDF)

Some Encryption Stuff

AWS KMW → create/control encryption keys to protect data

Encrypting RDS

data encrypted at rest → storage, automated backups, read replicas, snapshots
secure data from unauthorized access to underlying storage
RDS auto integrate with KMS
if want full control over KMS key → customer managed key
AWS managed key → less hassle but less control

Server side encryption w KMS (SSE-KMS)

all S3 buckets have encryption configured by default
all new objects uploaded are auto encrypted at rest
server side encryption w S3 managed keys (SSE-S3) is default encryption

AWS Glue DataBrew

visual data preparation tool, gives let us vlean n normalize data without need to write code. provides data masking mechanisms
250 read-made transformations to automate data prep tasks eg filtering anomalies, converting data to standard formats, correcting invalid values → speed in data prep
Identify n mask PII data in DataBrew

Sagemaker Data Wrangler

end-to-end solution to import, prep, transform featurize analyze data
import, data flow, transform, generate data insights, analyze, export
if use with PII, this exposes PII to ML model

AWS Glue Stuff

serverless data integration service → discover, prep, move, integerate data from multiple sources + ETL
Glue Studio → GUI to manage integration jobs

Discover and organize data

unify and search across multiple data stores → store, index, search across multiple data sources by cataloging all data
auto discover data → AWS glue crawlers to infer schema info, integrate into AWS glue data catalog
mange schemas n permissions → control access to db and tables
connect to many data stores → glue connections to build data lake

Transform, prep, clean data for analysis

drag n drop ETL interface
complex ETL pipelines + simple job scheduling → invoke aws glue jobs based on schedule, on demand or based on event
clean n transform streaming data in transit
deduplicate and cleanse data w built-in ML
built-in job notebooks → AWS Glue Job Notebooks
AWS Glue sensitive data detections → find PII stuff in process

Build nad monitor data pipelines

auto scale based on workload
automate jobs w event-based triggers
run AWS glue jobs with eg, spark or Ray. monitor using AWS glue job run insights and AWS cloudtrail
define workflows for ETL and integration activities

AWS GLue Data Catalog

contains references to dat athat is used as source/target for ETL jobs
to create data warehouse or data lake, must catalog this data
data catalog → index to location, schema, runtime metrics of my data
info stored as metadata tabkes, each table specify a single store

AWS Glue databases

used to organize metadata tables in AS glue
when we define a table in AWS glue data catalog, we add it to database
database contains table sthat define data from many diff data stores
partitioned table

AWS Glue connection → data catelog object that stores login credentials, uri strings, and vpc info for particular data store

documentdb, opensearch, glue for spark, redshift, kafka etc etc

Glue interactive sessions

quickly build test run data prep and analytics apps
GUI for ETL stuff

Data Processing Units (DPU)

monitoring section of AWS glue console uses results of previous job runs to determine appropriate DPU capacity
use job metrics in AWS glue to estimate number of DPUs that can be used to scale out an AWS glue job

Monitoring AWS Glue using Amazon Cloudwatch Metrics

profile n monitor glue operations using glue job profiler
collects n processes raw data into readable, near real-time metrics stored in cloudwatch → access historical info for analytics

Amazon Macie

Macie findings

repo of findings for macie account
finding → detailed report of potential issue
each finding → severity rating, info about affected resource + when and how macie found the issue

More Kinesis Stuff

ProvisionedThroughputExceededException → caused by capacity quota of data stream exceeding provisioned amount.

capacity quote defined by number of shards
enhanced fan-out → scales up number of stream consumers by offering each stream consumer its own read throughput
increase number of shards → capacity increases

Resharding a stream → adjust number of shards

2 types of resharding operationr 1) shard split 2) shard merge
shard split → divide single shard into 2 shards
shard merge → 2 shard become 1 shard
split → increases number of shards in stream → increase cost
merge → decrease number of shards in stream → decrease $$

Troubleshooting Kinesis Data Streams Consumers

KDS records skipped → most common cause is unhandled exceptions thrown → to avoid infinite retries on recurring failure, KCL doesnt resend batch of records
Record belonging to same shard processed by different record processors at same time → set appropriate failover time
consumer app reading at slow rate → increase number of shards

Appflow

fully-manageed integration service that allows us to securely exchange data between SAAS apps and AWS services

Appflow Flows

a flow transfers data between a source and a destination
data mapping determines how data from source is placed in dest

Amazon Managed Workflows For Apache Airflow (MWAA)

Apache Airflow → open source tool to programmatically author, schedule, monitor sequences of processes and tasks (workflows)

MWAA

managed orchestration service for apache airflow, can use to setup and operate data pipelines in cloud at scale
use apache airflow+Python to create workflows without caring abt infra

Some Container Storage Stuff

app that transform data in containers (managed by AWS EKS). each containerized app will transform independent datasets and store data in data lake + data desont need to be shared to other containers. Which solution meets this reqs w lowest latency?

CORRECT → containers should use ephemeral volume provided by node’s RAM
WRONG → containers should establish connection to DAX w app code
WRONG → containers should use PersistantVolume object provided by NFS storage
WRONG → containers should establihs connection to amazon MemoryDB for Redis within app code

Ephemeral Volume → storage local to each instance (lowest latency)

Some ML Stuff

SageMaker Feature Store

feature store → storage and data maangement layer for ML. single source of truth to store/retrieve/remove/track/share/discover/control access to features
online store → low latency, high avail store for feature group that enables real time lookup of recorsd
offline store → stores historical data in S2 bucket. used when low latency reads are not needed
feature group → logical group of features used to describe records
→ we can use sagemaker feature store to create, store, share features for ML model, but no native functinoality to ensure data is accurate, complete, trustworthy

SageMaker ML Lineage Tracking

creates n stores info about steps to ML workflow
ability to establish model governance and audit standards
ensure data being used to run ML decisions is accurate,complete, trustworthy

SageMaker Data Wrangler → run EDA and prep data (but no native functionality to ensure data is accurate complete trustworthy)

SageMaker processing → managed servie we can use to run operations eg. data-processing workloads, data validation, model eval

More Glue Stuff

AWS Glue DataBrew → data sanitiazation and normalization, can be used to remove outliers (some parameters to be specified)
AWS Glue Schema Registry → data discovery features of AWS glue, integrate number of AWS servives to entrally manage and enforce schemas and supports streaming data sources. does no have an exclude_outliers parameter
AWS Glue FindMatches → used to deduplicate a dataset
AWS Glue Data Catalog → used to store metadata on data sources
AWS Glue Spigot transformation → write ssample records to specified destination to help verify transformation performed by glue job
AWS Glue Job w Pushdown Predicate → enables us to filter data in AWS glue data catalog by a partition
AWS Glue SelectFields transform → creates new DynamicFrame from existing DynamicFrame, keeps only fields that we specify

Athena Create Table As Select (CTAS) query

creates new table in athena from results of SELECT statement
stores data files created by CTAS statement in specified S3 loc
create tables from query results in one step, without repeat

Kinesis Firehose VS Kinesis Data Streams

firehose can deiver streaming data directly to S2
KDS cannot directly — require intermediate stage
+ quicksight has anomaly insights feature

Amazon Aurora Fault Injection

Test instance crash → ALTER SYSTEM CRASH
Test replica failure → ALTER SYSTEM SIMULATE 50 PRECENT READ REPLICA FAILURE
Test disk failure → ALTER SYSTEM SIMULATE 5 PERCENT DISK FAILURE
Test disk congestion → ALTER SYSTEM SIMULATE 50 PERCENT DISK CONGESTION

Lake Formation Stuff

Lake Formation provides access according to user personas in conjunction with IAM permissions.
Athena supports data anonymization using hashing

Lake Formation Personas

IAM administrator → superuser, user who can create IAM users and roles. all permissions. CANNOT grant lake formation permissions if not also designated a data lake administrator
Data Lake Administrator → can register S3 locations, access data catalog, create db, create and run worflows, grant LF permissions to others
readonly admin → view only
data engineer → create db, create n run crawlers and workflows, grant lake formation permissions on data catalog tables that crawlers and workflows create.
data analyst → can run queries against data lake using eg athena. can only RUN queries
workflow → runs a workflow on behalf of user

Cloudwatch Cloudtrail

cloudtrail trail with cloudwatch logs enabled
cloudwatch metric filter to select __ API entries in cloudtrail logs
use metric filter to create cloudwatch alarm

Some Terminology

Cloudwatch logs → monitor, store, access log files from EC2 instances, Cloudtrail, Route 53 and others
Cloudwatch metric filter → filter some metrics
Cloudtrail Insights → analyze normal patterns of API calling, generates insights events when volume/error rate are outside normal patterns

STuff

DynamoDB projection expressions → string that identified attributes that we want. To retrieve a single attribute, specify its name; for multiple attributes, names must be comma separated

Athena Federated Query

if have data sources other than S3, use Athena Federated Query to query data, or build pipelines then extract to S3
allows data to be queried in place using SQL

Apache Flink

open-source, stream-processing, batch-processing framework
distributed streaming data-flow engine
executes arbitrary dataflow programs in parallel and pipeline

Continuously Copy SQL Data To Redshift

Use AWS Database Migration Service (DMS) for database replication
write transactions to S3 as unpartitioned parquet files
create partitioned and sorted parquet files using AWS Glue Job
run crawler to update Glue data catalog
Load the target table using Amazon Redshift stored procedure

Parquet files

CSV problems

no defined schema
lots of escaping eg. stuff with commas
slow to parse cus row
files cannot be filterred (no predicate pushdown)
intermediate use requires redefining schema (lots of work)

Parquet

columnar data format
supported in spark, etc
supports predicate pushdown, perf better
automatically stores schema info
binary file format

Working with Pqrquet

spark.read.format(‘parquet’).load(filename)
spark.read.parquet(filename)
df.write.parquet(filename)

Partitioned vs unpartitioned parquet

by dividing data into smaller chunks, partitioning can improve performance of queries + reduce storage costs

AWS CodeCommit

source control service that hosts private Git repos.
eliminates need for us to manage our own source control system, or worry about scaling infra
like a manged git repo

AWS CodePipeline

continuous delivery service we can use to modeul, visualize, automate steps required to release software
CICD stuff

AWS CodeDeploy

deployment service enables devs to automate deployment of apps to instances + update apps when required

Creating pipeline to automate deployment of cloudformation template

create pipline in AWS codepipeline using defualt location for artifact store
add a source stage, and set provider as primary codecommit repo
skip build stage
add deploy stage with cloudformation as provide, and action as “create or update a stack”
in output file name, enter “outputs”
use parameter overrides field to enter documentDB required info
edit deploy stage in pipeline to include deployment action that will create a change set for previously deployed stack

Tracking db connections and exact queries executed

enable audit logging using Redshift console, and configure logging on associated S3 bucket. Analyze the logs using Redshift Spectrum

Redshift database audit logging

connection log — authentications attempts, connections, disconnections
user log — changes to database use definitions
user activity log — logs each query before it runs on db

S3 preventing people from deleting stuff

MFA delete — prevent accidental deletes

Object lock — prevent objects from being deleted or overwritten for fixed amount of time. Uses Write-Once-Read-Many (WORM) model

retention period → specified time period which object remains locked
legal hold → same protection as retention period, but no expiration date
object lock works only in buckets with S3 versioning enabled

2 retention modes

compliance mode → protected object version cannot be overwritten or deleted by any user including root user. retention mode cannot be changed, retention period cannot be shortened
governance mode → users cant overwrite/delete object version or alter lock settings, unless user has special permission (less secure)

Legal hold → prevent object deletion/overwrite until legal hold is removed

S3 expiring older versions

Noncurrent version expiration

specify when noncurrent object versions expire
after expire → permanently delete
NewerNoncurrentVersions → how many newer noncurrent versions must exist before can delete
NoncurrentDays → how many days an object is noncurrent before delete

Noncurrent version transition

specify when to move noncurrent version to S3 infrequent, glacier etc

More Glue Stuff

dataset in S3, Glue to perform transformations. How to prep data to be ingested from DynamicFrame to redshift cluster?

use Relationalize transform to change dat format
can be used to convert data in a DynamicFrame into relational data format
once relationalized, data can be written to red shift cluster

Real-time Stream Processing Approach

Continuously ingest app log data, process it and write it to data lake + need real-time stream processing approach

Amazon managed streaming for Apache Kafka cluster (MSK). Create a Amazon managed service for Apache Flink app to read from MSK, process data and write output to data lake

Transparent Data Encryption

service available for oracle and SQL server databases
for oracle, it is integrated with AWS CloudHSM, which is single-tenant hardware security module for AWS cloud

AWS CloudHSM

combines cloud w security of Hardware Security Modules (HSMs)
HSM → computing device that processes cryptographic operations, provides secure storage for cryptographic keys
AWS cloudHSM → control over HSMs on the cloud
Pay by the hour

Uses cases

AWS cloudHSM to manage private keys that protect highly confidential data → enterprise security standards
encrypt n decrypt data → encryption in transit and at rest
using private key to sign a document allows recipient to VERIFY that I sent that document. Can do this with CloudHSM private/public keys
Cipher Message Authentication Codes (CMACs) and Hash Based Message Authentication Codes (HMACs) are used to authenticate and ensure integrity of messages sent over unsafe networks. with CloudHSM, can securely create and manage symmetric keys that support CMACs and HMACs
Combine CloudHSM and KMS to store key material in SINGLE-TENANT environment → CloudHSM key stores
To send data securely over HTTP, web servers use public/private key pairs and SSL/TLS public key certs to make HTTPS. CloudHSM can reduce computational burden → SSL/TLS offload
Enable Transparent Data Encryption (TDE) → used to encrypt database files. encrypts data b4 storing on disk.

Amazon DynamoDB Encryption Client

enables you to protect your data in transit and at rest
end to end protection for data from sources to destination

Some Notes on AWS Step Functions

serverless orchestration service where we can combine lambda w other AWS services to build apps through visual workflows

Amazon States Language

JSON based structured language to define state machine (a collection of states) that can do work (Task states), determine which states to transition to next (Choice states) etc

Types of states

Pass state → pass input to output without performing work (do nothing)
Task state → represents a single unit of work performed by state machine. Uses activity OR lambda function to perform work
Choice state → adds conditional logic to state machine
Wait State → makes state machine wait for some time
Success state → stops execution successfully, useful w choice states
Fail state → stops exec, mark as failure, useful w choice states
Parellel Sate → add separate branches of execution
Map State → run set of workflow steps for each item in dataset

Some Notes — Amazon Managed Streaming For Apache Kafka (MSK)

fully managed service, enables us to build/run apps that use kafka to process streaming data.
provides control-plane operations eg. create/update/delete clusters

Some Terminology

producers give data to kafka broker
consumers consume data from kafka broker
multiple brokers in case one broker goes down
messages auto replicated from one broker to another in case one die
zookeeper used to manage bunch of brokers
have multiple zookeepers, each coordinating between different brokers

MSK Serverless

possible to run Apache Kafka w/o needing to manage and scale cluster capacity — auto provisions and scale capacity for us

More Redshift Notes

fully managed, petabyte-scale data warehouse service in cloud
users can analyze data using standard SQL and existing BI tools
Columnar storage → faster query performance and efficient storage

Node types → Dense compute VS Dense Storage

Updating Lambda functions easily

eg Python script to do stuff, but if Python script change, engineer needs to update multiple lambda functions → need less manual way to do this

package custom Python script into Lambda layers.
Apply Lambda layers to the lambda functions

Lambda layers

a zip file archive that contains supplmentary code or data
usually contain library dependencies, custom runtime or config

Why use lambda layers

reduce size of deployment packages → instead of including all function dependencies in deployment packages, put them in layer
separate core function logic from dependencies → update function dependencies independent of function code n vice versa
share dependencies across multiple functions → after create layer, can apply it to any number of functions in account.

Some ETL Stuff

Step Functions → Not really for ETL
Glue studio → ETL, but doesn’t handle data pipeline
MWAA → can use, but managed service, complex n possibly $$$
Glue workflows

Amazon Redshift Data API

suitable for real time queries
access amazon redshift database using built-in redshift data api → can use lambda, sagemaker notebooks, cloud9
doesn’t require persistent connection, provides secure HTTP endpoint and integration with AWS SDKs

Athena permission controls

need permission controls to seperate query processes and access to query history among users, teams and apps in the same AWS account

Create Athena workgroup for each use case. Apply tags to workgroup. Create IAM policy that uses tags to apply appropriate permission to WG

Athena Workgroup

recommend using workgroups to isolate queries for teams, apps or different workloads
enforce cost constraints eg. per-query limit + per-workgroup limit
track query-related metric for all workgroup

AWS Glue Jobs Costs

Glue Flex Execution

Flex → Glue jobs run on spare capacity
reduce cost of pre-production, test, non-urgent data integration workloads by up to 34%
good if dont need fast job start times (spare capacity not always avail)

Glue Standard Execution

use mix of reserved and on-demand instances
not as cost effective as flex execution

Amazon Resource Name

uniquely identify AWS resources, require ARN when need to specify a resource unambiguously acorss all AWS

arn:partition:service:region:account-id:resource-id
arn:partition:service:region:account-id:resource-type/resource-id
arn:partition:service:region:account-id:resource-type:resource-id

Compression in Athena

snappy → quick compress/decompress rather than size of compress
LZ4 → also focus on compression speed rather than size
gzip → default write compression

Amazon Managed Service for Apache Flink

can use java/scala/python/sql to process and analyze streaming data
can author/run code against streaming sources n static sources to perform time-series analytics

2 options for running Flink jobs

managed service for apache flink
managed service for apache flink studio

Redshift Concurrency Scaling

supports 1000s of concurrent users and concurrent queries
when turn on, auto add additional cluster capacity

→ turn on concurrency scaling for workload management (WLM) queue

supports read/write/copy/insert/delete/update/create table statements
supports materialized view refresh for MVs w no aggregation

NOTE

WLM in redshift serverless no concurrency scaling
WLM in redshift provisioned → have concurrency scaling

Amazon Elastic Map Reduce

Apache Pig, Oozie, Spark, Hbase, Flink
process petabytes of data in a few seconds
ETL → Amazon EMR

3 kinds of Amazon EMR

Amazon EMR running on Amazon EC2
Amazon EMR on EKS
Amazon EMR serverless

Amazon EMR → managed cluster platform that simplifies running big data frameworks eg. hadoop, spark → process vast amounts of data

Nodes in Amazon EMR

primary node → node that manages cluster, coordinates distribution of data n tasks among other nodes for processing. tracks status of tasks, monitor health of cluster.
core node → node w software components that run tasks and store data in hadoop distributed file system on your cluster.
task node → node with software components that only run tasks, does not store data in HDFS (this is optional)

Glue Workflows

use workflows to create n visualize complex ETL activities involving multiple crawlers, jobs, triggers.
visual representation of workflow as a graph
can create from AWS Glue blueprint, OR manually build

Triggers in Glue Workflow

schedule → workflows starts according to schedule we define
on demand → we manually start it
EventBridge event → workflow starts from event

More ETL Stuff

multiple ETL workflows that ingest data from db to S3 data lake
use AWS Glue and Amazon EMR to process data
company wants to improve existing archi to provide automated orchestration which requires minimal manual effort

→ AWS Glue Workflows

Redshift Data Sharing

securely share access to live data across redshift clusters, workgroups, AWS accounts, AWS regions
no need to manually move or copy data

We can share data at different levels in Amazon redshift

database, schema, table, view levels etc

Redshift Spectrum

we can query and retrieve structured and semistructured data from S3
use massive parallelism to run very fast against large datasets

AWS Graviton

graviton processors are custom-built by AWS to deliver best price performance for cloud workloads
offer up to 40% better price performance

S3 partitions

large orgs have lots of data → usually stored in S3

Partitioning → lay out data optimally for distributed analytics engines

restrict amount of data scanned → improve performance + reduce cost
eg if data is partitioned by day, we get 1 single file every day

s3://mybucket/logs/year=x/month=6/day=1/test.json
s3://mybucket/logs/year=x/month=6/day=2/test.json
s3://mybucket/logs/year=x/month=6/day=3/test.json
s3://mybucket/logs/year=x/month=6/day=4/test.json

Select * rfom table where year=x and month=6 and day=1 → reads only data inside the partition folder year=x/month=6/day=1 → only scans this

systems like Athena, Redshift Spectrum and Glue can use these partitions to filter data by value → improve performance

AWS Glue Partition Index

over time, many many partitions are added to a table
when create partition index, we specify list of partition keys that already exist on a given table
partition index can be created on permutation of partition keys defined on table eg. (country, category, year)
data catalog will concatenate partition values in order

Athena Partitioning and bucketing

mainly to reduce amount of data athena needs to scan when querying

Partitioning

organizing data into directories (or prefixes) on S3 based on particular properties of data
such properties are known as partition keys eg date, year, month
good candidates for partition keys → properties that are frequently used in queries and have low cardinality

Bucketing

way of organizing records of dataset into categories called buckets
records with same value for a property go into same bucket
records are distributed as evenly as possible among buckets

Conclusion

I hope this was helpful in some way

If You Wish To Support Me As A Creator

Clap 50 times for this story
Leave a comment telling me your thoughts
Highlight your favourite part of the story

Thank you! These tiny actions go a long way, and I really appreciate it!

YouTube: https://www.youtube.com/@zlliu246

LinkedIn: https://www.linkedin.com/in/zlliu/

My Ebooks: https://zlliu.co/ebooks

Get an email whenever Liu Zuo Lin publishes.

Get an email whenever Liu Zuo Lin publishes. By signing up, you will create a Medium account if you don't already have…

zlliu.medium.com

Raw Practice Notes Dump — AWS Data Engineer Associate Week 1

# If anyone finds this useful

SQS Queue Stuff

Metadata store w fine-grained permissions

Redshift Stuff

Cron stuff

Kinesis Data Streams with AWS Lambda

Some Storage stuff

Redshift Stuff

Some Encryption Stuff

AWS Glue Stuff

Amazon Macie

More Kinesis Stuff

Appflow

Amazon Managed Workflows For Apache Airflow (MWAA)

Some Container Storage Stuff

Some ML Stuff

More Glue Stuff

Athena Create Table As Select (CTAS) query

Kinesis Firehose VS Kinesis Data Streams

Amazon Aurora Fault Injection

Lake Formation Stuff

Cloudwatch Cloudtrail

STuff

Continuously Copy SQL Data To Redshift

Parquet files

AWS CodeCommit

AWS CodePipeline

AWS CodeDeploy

Creating pipeline to automate deployment of cloudformation template

Tracking db connections and exact queries executed

S3 preventing people from deleting stuff

S3 expiring older versions

More Glue Stuff

Real-time Stream Processing Approach

Transparent Data Encryption

AWS CloudHSM

Amazon DynamoDB Encryption Client

Some Notes on AWS Step Functions

Some Notes — Amazon Managed Streaming For Apache Kafka (MSK)

More Redshift Notes

Updating Lambda functions easily

Some ETL Stuff

Amazon Redshift Data API

Athena permission controls

AWS Glue Jobs Costs

Amazon Resource Name

Compression in Athena

Amazon Managed Service for Apache Flink

Redshift Concurrency Scaling

Amazon Elastic Map Reduce

Glue Workflows

More ETL Stuff

Redshift Data Sharing

Redshift Spectrum

AWS Graviton

S3 partitions

AWS Glue Partition Index

Athena Partitioning and bucketing

Conclusion

If You Wish To Support Me As A Creator

Get an email whenever Liu Zuo Lin publishes.

Get an email whenever Liu Zuo Lin publishes. By signing up, you will create a Medium account if you don't already have…

Written by Liu Zuo Lin

Responses (1)