Glossary¶
Data Management¶
Async¶
Async (asynchronous) refers to a programming approach that allows tasks or operations to be executed without blocking the main program or process. Asynchronous code is particularly useful when a program needs to handle long-running or potentially slow operations (e.g., network requests, I/O operations) without pausing the execution of other parts of the program.
Instead of waiting for an operation to complete, the program continues running, and the result of the asynchronous task is handled later when it's ready. This enhances the performance and responsiveness of programs, especially in multitasking or high-load environments.
CRUD¶
CRUD is a set of basic operations used for working with data in a database:
- Create: The operation that allows adding new records to the database. It’s the first stage of interaction with data, where new information is introduced into the system.
- Read: The operation of retrieving data from the database. It provides access to existing records for use, viewing, or analysis.
- Update: The operation that modifies existing data. It is applied when there is a need to change information in the database.
- Delete: The operation of removing data from the database. It is used to delete unnecessary or outdated records from the system.
These four operations cover the full data management cycle, from creation to final deletion, and are essential for working with any type of database.
Data Ingestion¶
Data ingestion is the overall process of collecting, transferring, and loading data from one or multiple sources so that it may be analyzed immediately or stored in a database for later use. Data may be entered into a database, data warehouse, data repository or application.
Data can be streamed in real time or ingested in batches. When data is ingested in real time, each data item is imported as it is emitted by the source. When data is ingested in batches, data items are imported in discrete chunks at periodic intervals of time.
It can be challenging for businesses to ingest data at a reasonable speed. ... When data ingestion is automated, the software used to carry out the process may also include data preparation features to structure and organize data so it can be analyzed on the fly or at a later time by business intelligence (BI) and business analytics (BA) programs.
Data ingestion tools provide a framework that allows companies to collect, import, load, transfer, integrate, and process data from a wide range of data sources.
Data Integration¶
Data integration supports the analytical processing of large data sets by aligning, combining, and presenting each data set from organizational departments and external remote sources to fulfill integrator objectives. For example, a user’s complete data set may include extracted and combined data from marketing, sales, and operations that combined in such a way that it yields consistent, comprehensive, current, and correct information for business reporting and analysis. The source systems may be various types of devices and the data may be in a variety of formats.
Data integration is the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information. A complete data integration solution delivers trusted data from various sources. It combines data from multiple separate business systems into a single unified view, often called a single source of the truth. This unified view is typically stored in a central data repository known as a data warehouse. Data integration is often a prerequisite to other processes including analysis, reporting, and forecasting.
Data Lake¶
A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semistructured, and unstructured data. The data structure and requirements are not defined until the data is needed. A data lake is usually a single store of all enterprise data, including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics, and machine learning. A data lake can include structured data from relational databases (rows and columns), semistructured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning — to guide better decisions.
How does a data lake differ from a data warehouse?
A data warehouse is a database optimized to analyze relational data coming from transactional systems and line-of-business applications. The data structure and schema are defined in advance to optimize for fast SQL queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the single source of truth that users can trust.
A data lake stores relational data from line-of-business applications and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.
Data Migration¶
Data migration is the process of selecting, preparing, extracting, and transforming data and permanently transferring it from one computer storage system to another.
Whereas data integration involves collecting data from sources outside of an organization for analysis, migration refers to the movement of data already stored internally to different systems. Companies will typically migrate data when implementing a new system or merging to a new environment. Migration techniques are often performed by a set of programs or automated scripts that automatically transfer data. Additionally, the validation of migrated data for completeness and the decommissioning of legacy data storage are considered part of the entire data migration process.
However, ‘transfer’ is not the only aspect of data migration methodology. If the data is diverse, the migration process includes mappings and transformations between source and target data. Above all, data quality must be assessed before migration to ensure a successful implementation. The success rate of any data migration project is directly dependent on the diversity, volume, and quality of data being transferred.
Most migrations take place through five major stages:
- Extraction: remove data from the current system to begin working on it.
- Transformation: match data to its new forms, ensure that metadata reflects the data in each field.
- Cleansing: deduplicate, run tests, and address any corrupted data.
- Validation: test and retest that moving the data to the target location provides the expected response.
- Loading: transfer data into the new system and review for errors again.
Data pipeline¶
A data pipeline is a set of actions that extract data (or directly analytics and visualization) from various sources. It is an automated process: take these columns from this database, merge them with these columns from this API, subset rows according to a value, substitute NAs with the median and load them in this other database.
The purpose of a data pipeline is to move data from a point of origin to a specific destination. At a high level, a data pipeline consists of eight types of components:
- Origin – The initial point at which data enters the pipeline.
- Destination – The termination point to which data is delivered.
- Dataflow – The sequence of processes and data stores through which data moves to get from origin to destination.
- Storage – The datasets where data is persisted at various stages as it moves through the pipeline.
- Processing – The steps and activities that are performed to ingest, persist, transform, and deliver data.
- Workflow – Sequencing and dependency management of processes.
- Monitoring – Observing to ensure a healthy and efficient pipeline.
- Technology – The infrastructure and tools that enable dataflow, storage, processing, workflow, and monitoring.
Data warehouse¶
A data warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place (Wikipedia). Data warehousing is used to provide greater insight into the performance of a company by comparing data consolidated from multiple heterogeneous sources.
Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data scientists, and decision-makers access the data through business intelligence (BI) tools, SQL clients, and other analytics applications. Data warehouses can also feed data marts, which are subsets of data warehouses oriented for specific business lines, such as sales or finance. A data warehouse typically combines information from several data marts in multiple business lines, but a data mart contains data from a set of source systems related to one business line.
A data warehouse is designed to support business decisions by allowing data consolidation, analysis, and reporting at different aggregate levels. Data is populated into the DW through the processes of extraction, transformation, and loading.
Businesses employ data warehouses because a database designed to handle transactions isn’t designed to handle analytics. It isn’t structured to do analytics well. A data warehouse, on the other hand, is structured to make analytics fast and easy because they store data in a columnar database. Data warehouses are used for online analytical processing (OLAP), which uses complex queries to analyze rather than process transactions.
Database¶
A database is a set of tables composed of records and fields that hold data. Different tables contain information about different types of things. Each row in a database table represents one instance of the type of object described in that table. A row is also called a record. The columns in a table are the set of facts that we keep track of about that type of object. A column is also called an attribute. The number of columns within a single table depends on how many different types/categories of information we need to store within a database, while the number of rows is defined by the quantity of the objects.
A database management system (DBMS) is a software package designed to define, manipulate, retrieve, and manage data in a database. A DBMS generally manipulates the data itself, the data format, field names, record structure, and file structure. It also defines rules to validate and manipulate this data. Several different types of DBMS have been developed [including] hierarchical, network, [and] relational. Relational databases have enjoyed a long run as the database mainstay across a wide variety of businesses, and for good reasons.
SQL (Structured Query Language) is a programming language used to communicate with data stored in a relational database management system (RDBMS). SQL syntax is similar to the English language, which makes it relatively easy to write, read, and interpret. (SQL is ... pronounced in one of two ways. You can pronounce it by speaking each letter individually like 'S-Q-L,' or pronounce it using the word 'sequel.')
Conventional SQL (i.e. relational) databases are the product of decades of technology evolution, good practice, and real-world stress testing. They are designed for reliable transactions and ad hoc queries, the staples of line of business applications. When the era of big data hit, a new kind of database was required. The driver for NoSQL was the sheer shift in data volumes that the Internet brought. NoSQL systems store and manage data in ways that allow for high operational speed and great flexibility on the part of the developers. Unlike SQL databases, many NoSQL databases can be scaled horizontally across hundreds or thousands of servers.
The most popular database management systems include Oracle, MySQL, Microsoft SQL Server, PostgreSQL, MongoDB, and IBM Db2.
Database migrations¶
Database migrations refer to the process of modifying the structure of a database while preserving its existing data. Migrations allow developers to manage changes in the database schema, such as adding, modifying, or deleting tables and columns, without losing the current data stored in the database. This process is often handled automatically by ORM (Object-Relational Mapping) tools, which generate scripts to apply these changes consistently and safely across different environments. Migrations are essential for evolving applications while maintaining data integrity and minimizing downtime.
Document-oriented database¶
This type of database stores data as documents. These documents can be in formats such as JSON, BSON (binary JSON), XML, or other structured formats where data is represented as a hierarchical structure. It's important to note that document-oriented databases do not have a fixed schema, which allows for flexibility when storing data with different structures.
Key features:
- Schema flexibility: In document-oriented databases, there are no strict requirements for the structure of the data. One document may contain fields that are absent in another. This makes it easy to modify the structure of the data without needing to alter the database itself.
- Complex object storage: Documents can include nested objects, arrays, allowing the storage of complex data structures within a single document.
- Scalability: These databases are often used for horizontal scaling, which efficiently distributes data across multiple nodes.
- Example: MongoDB is one of the most well-known document-oriented databases.
ELT¶
ELT (extract, load, transform) provides a modern alternative to ETL. Instead of transforming the data before it’s written, ELT leverages the target system to do the transformation. The data is copied to the target and then transformed in place.
One of the main attractions of ELT is its reduction in load times relative to the ETL model. Taking advantage of the processing capability built into a data warehousing infrastructure reduces the time that data spends in transit and is more cost-effective. This capability is most useful for processing the large data sets required for business intelligence (BI) and big data analytics.
ETL¶
ETL is short for extract, transform, load — three database functions that are combined into one tool to pull data out of one database and place it into another database.
- Extract is the process of reading data from a database. In this stage, the data is collected, often from multiple and different types of sources.
- Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data.
- Load is the process of writing the data into the target database.
A properly designed ETL system extracts data from the source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions.
ETL can be contrasted with ELT (Extract, Load, Transform), which transfers raw data from a source server to a data warehouse on a target server and then prepares the information for downstream uses.
ETL pipeline¶
A data pipeline is a general term for a process that moves data from a source to a destination. ETL (extract, transform, and load) uses a data pipeline to move the data it extracts from a source to the destination, where it loads the data.
Avoid building this yourself if possible, as wiring up an off-the-shelf solution will be much less costly with small data volumes.
Lazy loading¶
Lazy loading is a mechanism that loads related data only when it is specifically requested, rather than loading it all at once. This helps optimize database queries and reduces unnecessary operations by deferring the loading of additional data until it is needed. By using lazy loading, the system avoids fetching large amounts of related data upfront, which can improve performance and resource efficiency, especially when dealing with complex relationships or large datasets.
Relational database¶
A relational database is based on the relational model, proposed by Edgar Codd. Data is stored in the form of tables (relations), where rows represent records, and columns represent fields with defined data types. Each table has a well-defined schema (a set of fields and their types), ensuring structured and rigid data storage.
Key features:
- Fixed schema: Data in a relational database is organized into tables with a clear structure, where each record must follow certain rules and constraints (such as data type for each column).
- Relationships between tables: Tables can be linked to each other using primary and foreign keys, allowing easy creation of relationships between different data objects (e.g., users and orders).
- SQL language: Relational databases use the Structured Query Language (SQL) for interacting with the database, enabling complex queries, filtering, aggregation, and data manipulation.
- Example: MySQL, PostgreSQL, Oracle Database — examples of relational databases.
SQL Injection¶
SQL Injection is a type of attack on web applications where an attacker injects malicious SQL code into an input field, aiming to trick the database into executing unwanted commands. A successful SQL injection can allow the attacker to access data they shouldn't have access to, modify or delete data, and sometimes even take full control of the database server.
SQL injection occurs when an application improperly handles user input and inserts it directly into an SQL query without proper validation or escaping. It is one of the most common and dangerous vulnerabilities in web applications.
Methods to prevent SQL injection:
- Using prepared statements: instead of inserting user input directly into the query, use parameters to prevent SQL manipulation.
- Escaping input: validate and sanitize user input to avoid the injection of SQL code.
- Using ORM (Object-Relational Mapping): ORM frameworks abstract interactions with the database, reducing the likelihood of writing vulnerable SQL code.
- Limiting database permissions: configure user roles with the least privileges necessary to minimize potential damage.
Infrastructure¶
CI/CD Pipeline¶
CI/CD Pipeline is a process that automates the building, testing, and deployment of software. It consists of several stages: first, the code is built, then tests are executed, and after successful testing, the code is automatically deployed to the appropriate environment (e.g., testing or production). The goal of such a pipeline is to speed up the development cycle, improve code quality, and minimize manual effort during deployment.
CI¶
CI (Continuous Integration) is the practice of frequently integrating code changes made by different developers into a shared codebase. The goal of CI is to catch and fix bugs early in the development process through automated tests that run every time code is changed. This ensures that every new piece of code is checked for compatibility with the existing code.
CD¶
CD (Continuous Delivery or Continuous Deployment) is the practice of automatically deploying code changes to testing or production environments. Continuous Delivery means that after passing the tests, the code can be deployed manually, while Continuous Deployment automatically pushes the code to production without manual intervention.
Runners¶
Runners are components used in CI/CD systems to execute tasks (jobs) such as building, testing, and deploying software. When a developer submits new code, a runner is responsible for performing the automated tasks as part of the CI/CD pipeline.
For example, GitLab Runner is an agent that executes tasks described in the .gitlab-ci.yml
file. Runners can be configured on different servers and can be run on local machines, in the cloud, or in containers. They allow parallel execution of processes, speeding up the pipeline and optimizing resource usage.
Triggers¶
Triggers in the context of CI/CD are mechanisms that initiate the execution of a pipeline or a specific task in response to a particular event. These events can include code changes like a commit to a repository, the creation of a new tag, a pull request, or even an external request via an API.
Triggers are used to automatically start the build, testing, or deployment processes when certain conditions are met. For instance, when code is pushed to the main branch, a trigger can launch the pipeline for automatic deployment to the production environment.
YAML¶
YAML (YAML Ain't Markup Language or Yet Another Markup Language) is a human-readable data serialization format used for configuration files and data exchange between applications. It relies on indentation to define data structure, making it easily readable and editable by humans. YAML supports various data types, including scalars (strings, numbers), lists, and dictionaries, allowing complex data structures to be described in a simple format.
In the context of CI/CD pipelines, YAML is often used to define process configurations. For example, in systems like GitLab CI/CD or GitHub Actions, files with the .yml
or .yaml
extension contain descriptions of the stages for building, testing, and deploying an application.
Jobs¶
Jobs in the context of CI/CD are individual steps or tasks executed as part of a pipeline. Each job represents a discrete stage, which might involve building code, running tests, performing security checks, or deploying the application. Jobs are typically run in a specific sequence and may depend on each other.
In a CI/CD pipeline, jobs are triggered automatically according to configurations defined in a configuration file (often in YAML format). For example, in GitLab CI, the .gitlab-ci.yml
file defines jobs like building the project or running unit tests. Jobs can be executed in parallel or sequentially, depending on their dependencies.
Stages¶
Stages in CI/CD are logical groupings of jobs that are executed sequentially. Each stage contains a set of related jobs that must be completed before moving on to the next stage. Stages simplify the pipeline structure and help organize the process, ensuring that certain actions (e.g., building) occur before others (e.g., testing or deployment).
For example, a typical CI/CD pipeline might consist of the following stages:
- Build: compiling or building the code.
- Test: running automated tests.
- Deploy: deploying the application to a testing or production environment.
If an error occurs in the test stage, subsequent stages (like deployment) will not be triggered.
Docker¶
Docker is a containerization platform that allows developers to build, deploy, and run applications in isolated environments called containers. Containers package everything needed to run the application: code, libraries, dependencies, and configuration files. Unlike virtual machines, containers share the host operating system’s kernel, making them more lightweight and efficient.
Docker simplifies managing dependencies and environment configurations, making applications more portable. This allows developers to build and test applications on one machine and then deploy them to another environment (e.g., a server or cloud) without modification. Docker is widely used in CI/CD pipelines to ensure consistent environments for building and testing software.
Kubernetes¶
Kubernetes is an open-source container orchestration platform that automates the deployment, management, and scaling of containerized applications. It manages clusters of machines running containers and distributes workloads across them, ensuring reliability, load balancing, and automatic recovery of applications in case of failures.
Key features of Kubernetes include:
- Container orchestration: automates the deployment and management of containers.
- Scaling: automatically scales applications based on demand.
- Service discovery and load balancing: manages internal and external networking between containers.
- Self-healing: restarts containers when they fail.
- Configuration and secret management: securely handles sensitive information and configuration.
Kubernetes has become the de facto standard for managing containerized applications in scalable infrastructures, such as cloud environments.
Helm¶
Helm is a package manager for Kubernetes that simplifies the installation and management of applications within Kubernetes clusters. It allows users to create, version, share, and deploy complex applications in Kubernetes using templates and predefined configurations. The core concept of Helm is the chart, which is a package of Kubernetes resources that describe an application and its dependencies.
Key features of Helm include:
- Simplified deployment: automates the deployment of complex applications in Kubernetes using standardized configurations.
- Versioning: supports chart versioning to simplify updates and rollbacks.
- Parameterization: enables customization of deployments without changing the base code using parameter values (values.yaml).
- Reusability: facilitates easy sharing of charts among teams and communities.
Helm makes it easier to manage complex applications in Kubernetes, standardizing and streamlining the process of deployment, upgrades, and management.
Ingress¶
Ingress in Kubernetes is an object that manages external access to services within the cluster. It defines rules for routing traffic from outside the cluster to specific services, offering features such as URL routing, load balancing, SSL termination, and virtual hosts. Unlike other access methods like NodePort or LoadBalancer, Ingress provides more flexible routing management and acts as a centralized entry point for applications.
Key features of Ingress:
- URL routing: directs requests to different services based on the path or host in the URL.
- SSL termination: manages SSL certificates to secure traffic, terminating HTTPS at the Ingress level.
- Load balancing: distributes traffic among different instances of an application to optimize load.
- Virtual hosts: allows serving multiple domains on a single IP address with different routing rules.
To function, Ingress requires an Ingress Controller, which processes the rules and ensures that traffic is routed correctly according to the Ingress configuration.
Grafana¶
Grafana is a powerful data visualization and monitoring platform that allows users to collect, analyze, and display data from various sources in interactive dashboards. Grafana supports a wide range of databases and monitoring services like Prometheus, InfluxDB, Elasticsearch, Graphite, and more, making it a versatile tool for monitoring infrastructure, applications, and business metrics.
Key features of Grafana:
- Dashboards: create customizable panels with graphs, charts, and metrics that can be easily tailored to different needs.
- Multi-data source support: connect to a wide range of databases and monitoring systems.
- Alerting: set up automated notifications based on specific conditions or metrics.
- Roles and access control: manage user access to data and dashboards, with support for integration with various authentication systems.
- Plugins: extend functionality with plugins for new data sources, visualization types, and other features.
Grafana is widely used for monitoring servers, applications, containers, and other infrastructure elements, enabling users to build centralized dashboards for real-time monitoring and data analysis.
AWS CodeBuild¶
AWS CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces deployable software packages. CodeBuild automatically scales to handle multiple builds concurrently, eliminating the need for you to manage any servers. It is commonly used in CI/CD pipelines to automate the build and test stages of applications.
Key features of AWS CodeBuild:
- Supports multiple languages and environments: works with a variety of programming languages and environments (e.g., Java, Python, Node.js, Docker), and allows the use of custom build images.
- Scalability: automatically scales based on workload without the need to configure servers.
- Integration with other AWS services: tightly integrates with services like AWS CodePipeline, S3, CloudWatch, IAM, and CodeCommit to create a complete CI/CD pipeline.
- Pay-as-you-go pricing: you only pay for the compute time you use for your builds.
- Testing and reporting support: can run unit tests and provide reports on code coverage and test results.
AWS CodeBuild is often used as part of a toolchain for automating the build, test, and deployment processes, particularly for projects deployed on AWS.
AWS ECR(Elastic Container Registry)¶
AWS ECR (Elastic Container Registry) is a fully managed container registry provided by AWS that allows developers to securely store, manage, and deploy Docker images and other container artifacts. ECR integrates with AWS services like ECS (Elastic Container Service), EKS (Elastic Kubernetes Service), and AWS Lambda, as well as with CI/CD tools such as AWS CodeBuild and CodePipeline to automate containerization workflows and deployments.
Key features of AWS ECR:
- Managed container storage: securely store and manage Docker images ready for deployment.
- AWS integration: seamlessly integrates with ECS, EKS, and other AWS services to simplify container deployment.
- Security: supports data encryption at rest and in transit, along with AWS IAM integration for access control.
- Scalability: automatically scales to accommodate a large number of images and high deployment traffic.
- Version control: supports versioning of container images, making it easy to track and manage different versions of images.
ECR is used for storing and managing container images, which can then be deployed in the cloud via services like ECS and EKS, making it a critical part of CI/CD pipelines in cloud-based and containerized infrastructures.
AWS Secret Manager¶
AWS Secrets Manager is a managed service provided by AWS that helps securely store, manage, and access sensitive information such as passwords, API keys, database credentials, and other secrets. Secrets Manager automatically rotates and updates secrets on a defined schedule, simplifying the security management of sensitive data.
Key features of AWS Secrets Manager:
- Secure secret storage: provides encryption of secrets using AWS KMS (Key Management Service) to protect data at all levels.
- Automatic rotation: enables automatic updating of secrets, such as passwords, through integration with databases and other services.
- Access control: integrates with AWS IAM to allow detailed control over who can access secrets, limiting it to authorized users or services.
- Integration with other AWS services: seamlessly integrates with services like Amazon RDS, Redshift, Lambda, and others to automatically retrieve secrets during runtime.
- Auditing and monitoring: supports logging of access to secrets via AWS CloudTrail, providing traceability and insights into actions related to secret management.
Secrets Manager enhances security and simplifies the management of sensitive information in cloud applications, particularly where frequent updates to keys and passwords are required.
AWS IAM¶
AWS IAM (Identity and Access Management) is an AWS service that allows you to manage access to AWS resources and services at the user and role level. IAM provides centralized security management, allowing you to create and control users, groups, and manage access permissions for different AWS services.
Key features of AWS IAM:
- User and group management: create and manage user accounts and groups for secure access to AWS resources.
- IAM Roles: allows assigning roles to AWS services (e.g., EC2 or Lambda) so that they can interact with other services on behalf of the user.
- Access policies: IAM uses policies to configure access based on the principle of least privilege, specifying who has access to which resources.
- Multi-factor authentication (MFA): supports an additional layer of security through multi-factor authentication.
- Temporary security credentials: provides temporary credentials for short-term tasks, such as accessing resources via AssumeRole.
IAM is critical for securing AWS environments by giving precise control over who can access resources and what actions they are allowed to perform.
SSL¶
SSL (Secure Sockets Layer) is an outdated cryptographic protocol that provided secure communication between a client and a server over the internet by encrypting data transmitted between them. SSL was used to protect the transmission of sensitive information, such as passwords, credit card details, and other confidential data, from being intercepted by third parties.
Although SSL has been replaced by its more secure successor, TLS (Transport Layer Security), the term SSL is still often used to refer to secure connections. Websites using SSL/TLS implement HTTPS (Hypertext Transfer Protocol Secure) to indicate that the connection is encrypted.
Key features of SSL:
- Encryption: ensures that data is encrypted, protecting it from being intercepted by malicious actors.
- Authentication: verifies that the website or server is who it claims to be through certificates.
- Data integrity: ensures that data is not altered or tampered with during transmission.
SSL certificates are issued by Certificate Authorities (CAs) and are required to establish secure HTTPS connections between websites and their users.