AWS snippets
Command line from notebook
# aws s3 ls [bucket_name]
aws s3 ls s3://my-bucket/my-folder
# Copy file from s3 to local storage
# aws s3 cp [bucket_name/file_mane] [file_name]
aws s3 cp s3://my-bucket/my-folder/my-file.csv ./my-file.csv
# show the first few lines of a text file
head -n 4./my-file.csv
Get AWS session, region, bucket, and accound_id with boto3 and sagemaker
import boto3
import sagemaker
sess = sagemaker.Session()
# S3 bucket name
bucket = sess.default_bucket()
# AWS region
region = boto3.Session().region_name
# Account ID
sts = boto3.Session(region_name=region).client(service_name="sts", region_name=region)
account_id = sts.get_caller_identity()['Account']
awswrangler
The first step is to create a glue catalog database:
import awswrangler as wr
wr.catalog.create_database(
name='...', # AWS Glue Catalog database name e.g. my_db_deep_learning
exist_ok=True
)
We can now register a specific .csv
file to the glue catalog:
res = wr.catalog.create_csv_table(
database='...', # AWS Glue Catalog database name
path='s3://{}/data/transformed/'.format(bucket), # S3 object path for the data
table='...', # table name e.g. my_table
columns_types={
'ids': 'int',
'my_feature': 'string',
'my_target': 'int'
},
mode='overwrite',
skip_header_line_count=1,
sep=','
)
The table is now visible from the glue section of the AWS website.
AWS Athena
AWS Athena allows for sql queries directly from an s3 bucket (it stores temporary binary files to speed up subsequent queries).
wr.athena.create_athena_bucket()
We can now perform any SQL query and store the result in a pandas dataframe:
my_query = """
SELECT my_target, COUNT(my_target) AS my_target_count
FROM my_table
ORDER BY my_target_count
"""
df = wr.athena.read_sql_query(
sql=my_query,
database=... # here it goes the database name e.g. my_db_deep_learning
)