Skip to content
  • Tuesday, 17 June 2025
  • 10:00:54 AM
  • Follow Us
Data Engineer

Data/ML Engineer Blog

  • Home
  • AL/ML Engineering
    • AWS AI/ML Services
    • Compute & Deployment
    • Core AI & ML Concepts
      • Data Processing & ETL
      • Decision Trees
      • Deep Learning
      • Generative AI
      • K-Means Clustering
      • Machine Learning
      • Neural Networks
      • Reinforcement Learning
      • Supervised Learning
      • Unsupervised Learning
    • Database & Storage Services
    • Emerging AI Trends
    • Evaluation Metrics
    • Industry Applications of AI
    • MLOps & DevOps for AI
    • Model Development & Optimization
    • Prompting Techniques
      • Adversarial Prompting
      • Chain-of-Thought Prompting
      • Constitutional AI Prompting
      • Few-Shot Prompting
      • Instruction Prompting
      • Multi-Agent Prompting
      • Negative Prompting
      • Prompt Templates
      • ReAct Prompting
      • Retrieval-Augmented Generation (RAG)
      • Self-Consistency Prompting
      • Zero-Shot Prompting
    • Security & Compliance
      • AWS KMS
      • AWS Macie
      • Azure Key Vault
      • Azure Purview
      • BigID
      • Cloud DLP
      • Collibra Privacy & Risk
      • HashiCorp Vault
      • Immuta
      • Okera
      • OneTrust
      • Privacera
      • Satori
  • Data Engineering
    • Cloud Platforms & Services
      • Alibaba Cloud
      • AWS (Amazon Web Services)
      • Azure Microsoft
      • Google Cloud Platform (GCP)
      • IBM Cloud
      • Oracle Cloud
    • Containerization & Orchestration
      • Amazon EKS
      • Apache Oozie
      • Azure Kubernetes Service (AKS)
      • Buildah
      • Containerd
      • Docker
      • Docker Swarm
      • Google Kubernetes Engine (GKE)
      • Kaniko
      • Kubernetes
      • Podman
      • Rancher
      • Red Hat OpenShift
    • Data Catalog & Governance
      • Amundsen
      • Apache Atlas
      • Apache Griffin
      • Atlan
      • AWS Glue
      • Azure Purview
      • Collibra
      • Databand
      • DataHub
      • Deequ
      • Google Data Catalog
      • Google Dataplex
      • Great Expectations
      • Informatica
      • Marquez
      • Monte Carlo
      • OpenLineage
      • OpenMetadata
      • Soda SQL
      • Spline
    • Data Ingestion & ETL
      • Apache Kafka Connect
      • Apache NiFi
      • Census
      • Confluent Platform
      • Debezium
      • Fivetran
      • Hightouch
      • Informatica PowerCenter
      • Kettle
      • Matillion
      • Microsoft SSIS
      • Omnata
      • Polytomic
      • Stitch
      • StreamSets
      • Striim
      • Talend
    • Data Lakes & File Standards
      • Amazon S3
      • Apache Arrow
      • Apache Avro
      • Apache Iceberg
      • Azure Data Lake Storage
      • CSV
      • Databricks Delta Lake
      • Dremio
      • Dremio
      • Feather
      • Google Cloud Storage
      • JSON
      • ORC
      • Parquet
    • Data Platforms
      • Cloud Data Warehouses
        • ClickHouse
        • Databricks
        • Snowflake
          • Internal and External Staging in Snowflake
          • Network Rules in Snowflake
          • Procedures + Tasks
          • Snowflake administration and configuration
          • Snowflake Cloning
      • Cloudera Data Platform
      • NoSQL Databases
      • On-Premises Data Warehouses
        • DuckDB
      • Relational Databases
        • Amazon Aurora
        • Azure SQL Database
        • Google Cloud SQL
        • MariaDB
        • Microsoft SQL Server
        • MySQL
        • Oracle Database
        • PostgreSQL
    • Data Streaming & Messaging
      • ActiveMQ
      • Aiven for Kafka
      • Amazon Kinesis
      • Amazon MSK
      • Apache Kafka
      • Apache Pulsar
      • Azure Event Hubs
      • Confluent Platform
      • Google Pub/Sub
      • IBM Event Streams
      • NATS
      • Protocol Buffers
      • RabbitMQ
      • Red Hat AMQ Streams
    • Data Warehouse Design
      • Data Governance and Management (DGaM)
        • Compliance Requirements
        • Data Lineage
        • Data Retention Policies
        • Data Stewardship
        • Master Data Management
      • Data Warehouse Architectures (DWA)
        • Enterprise Data Warehouse vs. Data Marts
        • Hub-and-Spoke Architecture
        • Logical vs. Physical Data Models
        • ODS (Operational Data Store)
        • Staging Area Design
      • Data Warehouse Schemas (DWS)
        • Data Vault
        • Galaxy Schema (Fact Constellation)
        • Inmon (Normalized) Approach
        • Kimball (Dimensional) Approach
        • Snowflake Schema
        • Star Schema
      • Database Normalization
      • Dimensional Modeling Techniques (DMT)
        • Bridge Tables
        • Conformed Dimensions
        • Degenerate Dimensions
        • Junk Dimensions
        • Mini-Dimensions
        • Outrigger Dimensions
        • Role-Playing Dimensions
      • ETL/ELT Design Patterns
        • Change Data Capture (CDC)
        • Data Pipeline Architectures
        • Data Quality Management
        • Error Handling
        • Metadata Management
      • Fact Table Design Patterns(FTDP)
        • Accumulating Snapshot Fact Tables
        • Aggregate Fact Tables
        • Factless Fact Tables
        • Periodic Snapshot Fact Tables
        • Transaction Fact Tables
      • Modern Data Warehouse Concepts (MDWC)
        • Data Lakehouse
        • Medallion Architecture
        • Multi-modal Persistence
        • Polyglot Data Processing
        • Real-time Data Warehousing
      • Performance Optimization (PO)
        • Compression Techniques
        • Indexing Strategies
        • Materialized Views
        • Partitioning
        • Query Optimization
      • Slowly Changing Dimensions(SCD)
        • SCD Type 0
        • SCD Type 1
        • SCD Type 2
        • SCD Type 3
        • SCD Type 4
        • SCD Type 6
        • SCD Type 7
    • Distributed Data Processing
      • Apache Beam
      • Apache Flink
      • Apache Hadoop
      • Apache Hive
      • Apache Pig
      • Apache Pulsar
      • Apache Samza
      • Apache Sedona
      • Apache Spark
      • Apache Storm
      • Presto/Trino
      • Spark Streaming
    • Infrastructure as Code & Deployment
      • Ansible
      • Argo CD
      • AWS CloudFormation
      • Azure Resource Manager Templates
      • Chef
      • CircleCI
      • GitHub Actions
      • GitLab CI/CD
      • Google Cloud Deployment Manager
      • Jenkins
      • Pulumi
      • Puppet: Configuration Management Tool for Modern Infrastructure
      • Tekton
      • Terraform
      • Travis CI
    • Monitoring & Logging
      • AppDynamics
      • Datadog
      • Dynatrace
      • ELK Stack
      • Fluentd
      • Graylog
      • Loki
      • Nagios
      • New Relic
      • Splunk
      • Vector
      • Zabbix
    • Operational Systems (OS)
      • Ubuntu
        • Persistent Tasks on Ubuntu
      • Windows
    • Programming Languages
      • Go
      • Java
      • Julia
      • Python
        • Dask
        • NumPy
        • Pandas
        • PySpark
        • SQLAlchemy
      • R
      • Scala
      • SQL
    • Visualization Tools
      • Grafana
      • Kibana
      • Looker
      • Metabase
      • Mode
      • Power BI
      • QuickSight
      • Redash
      • Superset
      • Tableau
    • Workflow Orchestration
      • Apache Airflow
      • Apache Beam Python SDK
      • Azkaban
      • Cron
      • Dagster
      • DBT (data build tool)
      • Jenkins Job Builder
      • Keboola
      • Luigi
      • Prefect
      • Rundeck
      • Temporal
  • Home
  • Data Engineering
  • Data Lakes & File Standards
  • Amazon S3
Amazon S3: The Foundation of Modern Data Lakes

Amazon S3: The Foundation of Modern Data Lakes

In the evolving landscape of big data, organizations face the challenge of storing, managing, and analyzing vast amounts of information efficiently and cost-effectively. Amazon Simple Storage Service (S3) has emerged as the cornerstone technology for building scalable data lakes—centralized repositories that allow you to store all your structured and unstructured data at any scale.

The Evolution of Data Storage

Traditional data storage approaches faced significant limitations when confronted with the volume, variety, and velocity of modern data. Relational databases required rigid schemas, making them ill-suited for diverse data types. On-premises storage solutions demanded large upfront investments and couldn’t easily scale with growing data needs.

Enter Amazon S3, introduced in 2006 as one of AWS’s first services. What began as a simple object storage offering has evolved into a sophisticated platform that powers everything from simple backup solutions to complex enterprise data lakes serving petabytes of information to thousands of applications and users.

What Makes S3 Unique for Data Lakes?

S3’s architecture provides several key advantages that make it ideal for data lake implementations:

Virtually Unlimited Scalability

S3 can store virtually unlimited amounts of data without degradation in performance. Objects can range from a few bytes to terabytes in size, and a single bucket can contain trillions of objects. This scalability eliminates the need for capacity planning and allows data lakes to grow organically with business needs.

┌─────────────────────────────────────────────────────────┐
│                      Amazon S3                          │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  • Unlimited storage capacity                           │
│  • Individual objects up to 5TB                         │
│  • Trillions of objects per bucket                      │
│  • 99.999999999% durability (11 nines)                  │
│  • 99.99% availability                                  │
│                                                         │
└─────────────────────────────────────────────────────────┘

Cost-Effective Tiered Storage

S3 offers multiple storage classes optimized for different use cases:

  • S3 Standard: For frequently accessed data
  • S3 Intelligent-Tiering: Automatically moves objects between access tiers
  • S3 Standard-IA and S3 One Zone-IA: For infrequently accessed data
  • S3 Glacier and S3 Glacier Deep Archive: For long-term archival

This tiered approach allows organizations to optimize costs based on access patterns while maintaining a single management interface.

Built-in Data Protection

S3 provides multiple layers of protection:

  1. Versioning: Preserves multiple variants of an object, allowing recovery from accidental deletions or overwrites
  2. Replication: Cross-region and same-region replication for disaster recovery and compliance
  3. Object Lock: Write-once-read-many (WORM) protection for regulatory requirements

Rich Metadata and Querying Capabilities

S3 allows custom metadata to be attached to objects, enabling sophisticated organization and retrieval patterns. Services like S3 Select and Amazon Athena provide SQL-like query capabilities directly on data stored in S3, eliminating the need to move data to specialized analytics platforms.

Technical Architecture of S3-Based Data Lakes

A well-designed S3 data lake consists of several key components:

Storage Organization

The foundation of an effective data lake is a thoughtful organization structure:

s3://company-data-lake/
├── raw/                  # Raw data as ingested
│   ├── sales/
│   ├── marketing/
│   └── operations/
├── stage/                # Cleansed and validated data
│   ├── sales/
│   │   ├── year=2023/
│   │   │   ├── month=01/
│   │   │   │   ├── day=01/
│   │   │   │   │   └── sales_data.parquet
│   │   │   │   └── day=02/
│   │   │   └── month=02/
│   │   └── year=2022/
│   └── marketing/
└── analytics/            # Processed data optimized for analytics
    ├── sales_by_region/
    ├── customer_360/
    └── product_performance/

This organization employs several best practices:

  1. Multi-stage approach: Separating raw, intermediate, and analytics-ready data
  2. Partitioning: Using path patterns that align with common query patterns
  3. Domain separation: Organizing data by business domain

Data Formats and Compression

S3-based data lakes typically leverage optimized file formats:

  • Parquet: Columnar format ideal for analytical queries
  • ORC: Optimized Row Columnar format for Hadoop ecosystems
  • Avro: Row-based format with strong schema evolution
  • JSON & CSV: For interoperability with external systems

Compression codecs like Snappy, GZIP, or ZSTD further reduce storage costs and improve query performance.

Metadata Management

Two approaches to metadata management are common:

  1. AWS Glue Data Catalog: Centralized repository of table definitions and schema information
  2. Custom metadata solutions: Third-party or home-grown catalog systems
python# Defining a table in AWS Glue Data Catalog
glue_client = boto3.client('glue')

response = glue_client.create_table(
    DatabaseName='sales_database',
    TableInput={
        'Name': 'transactions',
        'StorageDescriptor': {
            'Columns': [
                {'Name': 'transaction_id', 'Type': 'string'},
                {'Name': 'customer_id', 'Type': 'string'},
                {'Name': 'amount', 'Type': 'double'},
                {'Name': 'transaction_date', 'Type': 'timestamp'}
            ],
            'Location': 's3://company-data-lake/stage/sales/',
            'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
            }
        },
        'PartitionKeys': [
            {'Name': 'year', 'Type': 'string'},
            {'Name': 'month', 'Type': 'string'},
            {'Name': 'day', 'Type': 'string'}
        ]
    }
)

Access Patterns

Multiple access patterns can be employed against S3 data lakes:

  1. Direct API access: Applications using the S3 API
  2. SQL queries: Using Athena, Redshift Spectrum, or EMR
  3. Spark processing: Via EMR, Glue, or third-party Spark implementations
  4. Specialized analytics: Using services like QuickSight or SageMaker

Implementing S3 Data Lakes: Best Practices

Security Implementation

A comprehensive security approach includes:

  • IAM policies: Fine-grained access control
  • Bucket policies: Bucket-level permissions
  • Access Control Lists: Object-level permissions
  • S3 Block Public Access: Preventing accidental exposure
  • Encryption: Server-side (SSE-S3, SSE-KMS, SSE-C) and client-side options
json// Example bucket policy enforcing encryption
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyIncorrectEncryptionHeader",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::company-data-lake/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "aws:kms"
        }
      }
    },
    {
      "Sid": "DenyUnencryptedObjectUploads",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::company-data-lake/*",
      "Condition": {
        "Null": {
          "s3:x-amz-server-side-encryption": "true"
        }
      }
    }
  ]
}

Data Lifecycle Management

Effective lifecycle management reduces costs while maintaining performance:

  • Transition rules: Moving objects between storage classes
  • Expiration rules: Deleting obsolete data
  • Intelligent-Tiering: Automating storage class selection
json// Lifecycle configuration example
{
  "Rules": [
    {
      "ID": "Move to IA after 30 days, archive after 90, delete after 7 years",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "raw/"
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 2555
      }
    }
  ]
}

Performance Optimization

Several techniques ensure optimal performance:

  1. Request parallelization: Distributing requests across multiple connections
  2. Partitioning strategy: Aligning with query patterns to minimize scanned data
  3. Prefix optimization: Distributing objects across multiple prefixes for high-throughput scenarios
  4. Compression settings: Balancing between storage savings and processing overhead
  5. S3 Transfer Acceleration: For uploading data from distant locations

Cost Management

Controlling costs in S3 data lakes involves:

  • Storage class selection: Matching storage classes to access patterns
  • Lifecycle policies: Automating transitions to lower-cost tiers
  • Data compression: Reducing overall storage volume
  • S3 Analytics: Identifying cost optimization opportunities
  • Request optimization: Minimizing LIST operations on large buckets

Real-World Data Lake Architectures

Log Analytics Platform

┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ Application   │     │  Kinesis Data │     │ Amazon S3     │
│ Logs          │────▶│  Firehose     │────▶│ (Raw Zone)    │
└───────────────┘     └───────────────┘     └───────┬───────┘
                                                    │
                                                    ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ Amazon        │     │ Amazon EMR    │     │ Amazon S3     │
│ Athena        │◀────│ (Spark)       │◀────│ (Processed)   │
└───────────────┘     └───────────────┘     └───────────────┘

This architecture enables:

  • Real-time log collection from thousands of sources
  • Cost-effective storage of petabytes of log data
  • On-demand analysis without pre-provisioning resources

Customer 360 Platform

┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ CRM Data      │     │               │     │               │
│ Sales Data    │────▶│  AWS Glue     │────▶│  Amazon S3    │
│ Support Data  │     │  ETL Jobs     │     │  Data Lake    │
└───────────────┘     └───────────────┘     └───────┬───────┘
                                                    │
                                                    ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ Amazon        │     │ Amazon        │     │ Amazon        │
│ QuickSight    │◀────│ Redshift      │◀────│ SageMaker     │
└───────────────┘     └───────────────┘     └───────────────┘

Benefits include:

  • Unified view of customer interactions across channels
  • Scalable machine learning to predict customer behavior
  • Self-service analytics for business users

IoT Data Platform

┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ IoT Devices   │     │ AWS IoT Core  │     │ Amazon S3     │
│ (Sensors)     │────▶│               │────▶│ Raw Zone      │
└───────────────┘     └───────────────┘     └───────┬───────┘
                                                    │
                                                    ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ Amazon        │     │ AWS Lambda    │     │ Amazon S3     │
│ Timestream    │◀────│ Functions     │◀────│ Processed     │
└───────────────┘     └───────────────┘     └───────────────┘

This approach delivers:

  • Scalable ingestion of time-series data from millions of devices
  • Tiered storage strategy optimized for both real-time and historical analysis
  • Cost-effective long-term retention of device data

Advanced S3 Features for Data Lakes

S3 Select and Glacier Select

These features enable server-side filtering of data, reducing the amount of data transferred and processed:

python# Using S3 Select to filter data
response = s3_client.select_object_content(
    Bucket='company-data-lake',
    Key='raw/sales/2023/01/01/transactions.csv',
    ExpressionType='SQL',
    Expression="SELECT s.customer_id, s.amount FROM S3Object s WHERE s.amount > 100",
    InputSerialization={'CSV': {'FileHeaderInfo': 'Use'}},
    OutputSerialization={'CSV': {}}
)

S3 Access Points

Access points simplify managing access to shared datasets:

json// Access point policy for analytics team
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/AnalyticsTeamRole"
      },
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:us-east-1:123456789012:accesspoint/analytics-access/object/*",
        "arn:aws:s3:us-east-1:123456789012:accesspoint/analytics-access"
      ]
    }
  ]
}

S3 Object Lambda

S3 Object Lambda allows you to add custom code to GET, LIST, and HEAD requests to modify and process data as it is retrieved:

python# Lambda function to redact sensitive information
def lambda_handler(event, context):
    object_get_context = event["getObjectContext"]
    request_route = object_get_context["outputRoute"]
    request_token = object_get_context["outputToken"]
    s3_url = object_get_context["inputS3Url"]

    # Get object from S3
    response = requests.get(s3_url)
    original_object = response.content.decode('utf-8')
    
    # Apply transformation (redact credit card numbers)
    transformed_object = re.sub(r'\b(?:\d{4}[ -]?){3}\d{4}\b', 'XXXX-XXXX-XXXX-XXXX', original_object)
    
    # Write back to S3 Object Lambda
    s3 = boto3.client('s3')
    s3.write_get_object_response(
        Body=transformed_object,
        RequestRoute=request_route,
        RequestToken=request_token)
    
    return {'status_code': 200}

S3 Batch Operations

For large-scale changes across many objects:

python# Creating a batch job to apply object tags
response = s3_control_client.create_job(
    AccountId='123456789012',
    Operation={
        'S3PutObjectTagging': {
            'TagSet': [
                {
                    'Key': 'data-classification',
                    'Value': 'confidential'
                }
            ]
        }
    },
    Report={
        'Bucket': 'company-job-reports',
        'Format': 'Report_CSV_20180820',
        'Enabled': True,
        'Prefix': 'batch-tagging-job',
        'ReportScope': 'AllTasks'
    },
    Manifest={
        'Spec': {
            'Format': 'S3BatchOperations_CSV_20180820',
            'Fields': ['Bucket', 'Key']
        },
        'Location': {
            'ObjectArn': 'arn:aws:s3:::company-manifests/confidential-files.csv',
            'ETag': 'etagvalue'
        }
    },
    Priority=10,
    RoleArn='arn:aws:iam::123456789012:role/BatchOperationsRole',
    ClientRequestToken='a1b2c3d4-5678-90ab-cdef'
)

The Future of S3-Based Data Lakes

Several emerging trends are shaping the evolution of S3-based data lakes:

Data Mesh Architecture

The data mesh paradigm distributes ownership of domains to teams closest to the data, with S3 providing the flexible foundation for this approach:

┌───────────────────────────────────────────┐
│           S3-based Data Mesh              │
├───────────┬───────────┬───────────────────┤
│ Marketing │ Sales     │ Operations        │
│ Domain    │ Domain    │ Domain            │
│           │           │                   │
│ s3://mkt/ │ s3://sls/ │ s3://ops/         │
└───────────┴───────────┴───────────────────┘
           ▲           ▲           ▲
           │           │           │
┌──────────┴───────────┴───────────┴────────┐
│        Cross-Domain Governance Layer       │
└───────────────────────────────────────────┘
           ▲           ▲           ▲
           │           │           │
┌──────────┴───────────┴───────────┴────────┐
│        Self-Service Analytics Layer        │
└───────────────────────────────────────────┘

Lakehouse Architectures

The lakehouse pattern combines the best features of data lakes and data warehouses:

  • S3 for raw storage
  • Table formats like Apache Iceberg, Delta Lake, or Apache Hudi
  • ACID transactions on data lake storage
  • Performance optimizations like indexing and caching

Real-Time Data Lakes

Increasingly, data lakes are supporting real-time or near-real-time workloads:

  • Streaming ingestion via Kinesis Data Streams or MSK
  • Change data capture (CDC) pipelines
  • Incremental processing frameworks
  • Real-time query engines operating directly on S3

Conclusion

Amazon S3 has fundamentally transformed how organizations approach data storage and analytics. By providing a scalable, durable, and cost-effective foundation for data lakes, S3 enables businesses of all sizes to harness the full value of their data assets.

The flexible nature of S3 storage, combined with AWS’s rich ecosystem of analytics services, creates a powerful platform that can adapt to evolving business needs. From startups just beginning their data journey to enterprises managing petabytes of information, S3-based data lakes provide the infrastructure needed to drive insights and innovation.

As data continues to grow in volume and importance, the role of S3 as the bedrock of modern data architecture will only become more critical. Organizations that master the capabilities of S3 data lakes position themselves to unlock the full potential of their data in the age of AI and advanced analytics.


Hashtags: #AmazonS3 #DataLakes #CloudStorage #BigData #AWS #DataArchitecture #ObjectStorage #DataEngineering #CloudComputing #Analytics

Recent Posts

  • The Unstructured Data Breakthrough
  • Databricks vs. Snowflake: The Performance Edge They Hide
  • GenAI-Assisted Data Cleaning
  • Iceberg vs. Hudi vs. Delta Lake
  • The Great Cloud Vendor War

Recent Comments

  1. Ustas on The Genius of Snowflake’s Hybrid Architecture: Revolutionizing Data Warehousing

Archives

  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023

Categories

  • AI
  • Analytics
  • AWS
  • ClickHouse
  • Data
  • Databricks
  • DataLake
  • DuckDB
  • ETL/ELT
  • Future
  • ML
  • Monthly
  • OpenSource
  • Snowflake
  • StarRock
  • Structure
  • VS
YOU MAY HAVE MISSED
The Unstructured Data Breakthrough
Data
The Unstructured Data Breakthrough
Alex Jun 17, 2025
Databricks vs. Snowflake: The Performance Edge They Hide
Databricks Snowflake VS
Databricks vs. Snowflake: The Performance Edge They Hide
Alex Jun 16, 2025
GenAI-Assisted Data Cleaning: Beyond Rule-Based Approaches
AI Data
GenAI-Assisted Data Cleaning
Alex Jun 14, 2025
Iceberg vs. Hudi vs. Delta Lake
Data VS
Iceberg vs. Hudi vs. Delta Lake
Alex Jun 13, 2025

(c) Data/ML Engineer Blog