DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Tools

Development and programming tools are used to build frameworks, and they can be used for creating, debugging, and maintaining programs — and much more. The resources in this Zone cover topics such as compilers, database management systems, code editors, and other software tools and can help ensure engineers are writing clean code.

icon
Latest Refcards and Trend Reports
Trend Report
Kubernetes in the Enterprise
Kubernetes in the Enterprise
Refcard #366
Advanced Jenkins
Advanced Jenkins
Refcard #378
Apache Kafka Patterns and Anti-Patterns
Apache Kafka Patterns and Anti-Patterns

DZone's Featured Tools Resources

The Importance of Code Profiling in Performance Engineering

The Importance of Code Profiling in Performance Engineering

By RadhaKrishna Prasad
When we discuss code profiling with a team of developers, they often say, "We don't have time to profile our code: that's why we have performance testers," or, "If your application or system runs very slowly, the developers and performance testers may suggest the infra team to simply add another server to the server farm." Developers usually look at code profiling as additional work and as a challenging process. Everyone in the project enters the phase of performance and memory profiling only when something is seriously a problem with performance in production. Due to a lack of knowledge and experience on how to profile and how various profilers work with different profiling types, many of us will fail to identify and address performance problems. As 70 to 80 percent of performance problems are due to inefficient code, it is recommended to use code profiling tools to measure and analyze the performance degradations at the early stages of development. This will help developers and performance engineers to find and fix the performance issues early which can make a big difference overall, especially if all the developers are testing and profiling the code as soon as they write. This article is primarily intended for the following audiences: developers, leads, architects, business analysts, and, most particularly, performance engineers. What Is Code Profiling? In most codebases, no matter how large they are, there are a few places where there is something that is always slow. We start by measuring the total time of the functionality you find slow, using the available profilers to measure everything in detail to find out which function calls are slow. Then comes the hard part: we have to figure out where the time is spent, why it is spent there, and what can be done about it. Here comes code profiling, which is a process used in software engineering to measure and analyze the performance of a program or code in an application. It gives us a complete breakdown of the execution time of each method in the source code, including memory allocation and function calls, and it helps developers and performance engineers identify which specific areas of the code are causing bottlenecks or slowing down the overall performance of the application/system. Developers and performance engineers can use various free and commercial code profiling tools to profile and understand which areas of the code take the longest to run, analyze resource utilizations, detect memory-related problems, and to allow them to prioritize their optimization efforts to fine-tune those problematic regions. The code profiling process helps in identifying and eliminating performance issues, and optimizing code execution, eventually resulting in improving the overall performance of the software application/system. Why Code Profiling? Code profiling will help discover which parts of your application consume an unusual amount of time or system resources. For example, a single function or two more functions called together takes up 70% of the CPU or execution time. When we encounter a performance problem in production or any load test, we conduct a thorough code profiling in order to find out which lines of the code consume the most CPU cycles and other resources. According to the Pareto principle, also known as the 80/20 rule, 80 percent of any speed problem lies in 20 percent of the code. Code profiling in performance engineering will throw good insights into components or resources and help us identify and analyze all the performance degradations across many places in a large-scale distributed environment. Code profiling goes beyond the basic performance statistics collected from system performance monitoring tools to the functions and allocated objects within the executing application. When profiling a Java or .NET application, the execution speeds of all the functions and the resources they utilize are logged for a specific set of transactions depending on what profiling type we choose. The data collected in code profiling will provide more information about where there could be performance bottlenecks and memory-related problems. Code Profiling Types Be it Java, Python, or .NET, there are several methods by which we can do code profiling. To work with any profiling tool, one must have a solid understanding of profiling types. Profiling can be done using various techniques and types, each with its pros and cons. There are many profiling types available for all Java, Python .NET, etc. Here, we will discuss the different types of code profiling below: Sampling The sampling profiling type has minimal overhead and takes frequent periodic snapshots of the threads running in your application to check what methods are being executed and what objects are stored on the heap. It averages the collected information and gives you a picture of what your application is doing, and this sampling profiling type has a low-resolution analysis. It is not very invasive and has a slight impact on performance. As a beginner, if you are not sure which profiling to choose, always start with sampling profiling Instrumentation The instrumentation profiling type involves injecting the code at the beginning and end of methods but also comes with a greater performance overhead. This gives very accurate timings for how long methods take to execute and how frequently they are invoked. However, if not used correctly, this will have a large impact on your application's performance. To be specific, it is recommended to have a clear understanding of which parts of your application you want to profile and just instrument only that to have less impact on the application performance. Performance Profiling Performance profiling is all about finding out which areas of your program use an excessive amount of time or system resources. For example, if a single function, method, or call consumes 80% of the execution time/CPU time, it usually requires investigation. Performance profiling will additionally reveal where an application typically spends its time, how it competes for servers and local resources, and highlight the potential performance bottlenecks that need optimization. It is not that the developers and performance engineers must spend lots of time on micro-performance profiling the code. However, with the right appropriate profiling tools and training, they can identify potential problems and performance degradations and fix them before we push our complete tested code into PROD. To measure an application's performance, you have to identify how long a particular transaction takes to execute. You must then be able to break down the results in several ways; particularly, function calls and function call trees (the chain of calls created when one function calls another, etc.). This breakdown identifies the slowest function as well as the slowest execution path, which is useful because a single function or a number of functions together can be slow. The main objective of performance profiling is to identify: "Which area or line of the code is slow?" Is it the client side, server side, network side, OS side, web server, application server, database server, or any other component? A multilayered distributed application can be very hard to profile, just due to the large amount of parameters that could be involved. If you are unsure whether the issue is with the application or the database, APM tools can help in identifying the responsible layer (web, app, or DB layer). Sometimes, the network monitoring tool may be required in scenarios where the issue is even more complex which helps to analyze the packet journey times, server processing time, network time, and network issues such as congestion, bandwidth, or latency. Once you have identified the problematic layer (or, if you prefer, the slow bit), you will have a better idea of what kind of profiler and type to use. Naturally, if it is a database problem, use one of the profiling tools offered by the database vendor's products to identify the problems. Memory Profiling The biggest benefit of memory profiling an application when it's still under development is that it allows developers to identify any excessive memory consumptions, bottlenecks, or primary processing hotspots in the code immediately. If the entire team of developers uses this approach, performance gains can be tremendous. Java profilers are agents and what they do is add instrumentation code to the beginning and end of methods to track how long the methods take. They add code into the constructor and finalize the method of every class to keep track of how much memory is used. The way developers write the code will directly impact the performance of the application when the objects we create are allocated and destroyed. In most cases, the application could use more memory than necessary which will cause the memory manager to work harder which will eventually lead to memory problems like memory leaks, out-of-memory errors, performance degradations, excessive memory consumption, application crashes, application restarts, application slowness, GC times greater than 20 to 30 percent, etc. Many profiling tools available in Java and .NET, like JProfiler, JVisualVM, JConsole, YourKit Profiler, Redgate ANTS profiler, dotTrace, or any other profilers, will allow developers to take memory snapshots at different intervals and then compare them against each other to find classes and objects that require immediate investigation. Memory profilers can help us to identify the largest allocated objects, methods, and call trees responsible for allocating large amounts of memory. Using various memory profiling tools, we need to profile the application to collect GC stats, object lifetime, and object allocation information. This helps identify expensive allocated objects and functions, memory leaks, and heap memory issues in young (Eden, S0, and S1) and old generations for Java, as well as SOH and LOH for .NET. Also, it helps functions allocate large memory, types with the most memory allocated, types with the most instances, most memory-expensive function call trees, etc. Some profilers can track memory allocation by function calls, which will allow us to see the functions that are the reason for leaking memory and this is also an effective technique to find out a memory leak. CPU Profiling This profiling type measures how much CPU time is spent on each function or line of code, helping to identify bottlenecks and areas for optimization. Any function with high CPU utilization is an excellent choice for optimization because excessive resource consumption can be a major bottleneck. The profiling tools will help to identify the most CPU-intensive lines of code within the function and figure out if there are suitable optimizations that can be applied. Thread Profiling This tracks the behavior and usage of threads in a program, helping to identify potential concurrency issues or thread contention. To address problems created by multiple threads accessing shared resources, developers use synchronization techniques to control access to those shared resources. It's an excellent concept in general, but yet it could lead to threads fighting for the same resource, resulting in locks if not implemented correctly. To identify the performance problems, thread contention profiling analyzes thread synchronization within the running application which hooks into Java and native synchronization methods and records when and for how long blocking happens, as well as the call stack, which comes with greater overhead. Network Profiling This helps to identify the number of bytes generated by the method or call tree, as well as functions that generate a high level of network activity that must be investigated and fixed. The developers and performance engineers must ensure the number of times this network activity occurs is as low as possible, to reduce the effect of latency in load tests. How To Choose The Right Code Profiling Tool Choosing the right code profiling tool generally depends on several parameters, including the programming language that you have chosen, the tech stack, the scope of your project, the type of specific performance issues you are interested in solving, and your overall budget. The first step is very simple: work with free tools and then commercial tools, as most tool providers allow you to download full evaluation copies of their tools, usually with limited-duration licenses (14 days trial and can be extended in case you need more time to evaluate your application by sending an email to the support team). Make use of this and just make sure the thing works in your application with all features. How can software developers and performance engineers guarantee that their application code is fast, efficient, and perceived as valuable? Regardless of how skilled your development team is, very few lines of code work optimally when initially written. Code must be analyzed, debugged, and reviewed to discover the most effective approach to speed it up. The approach is to use a profiling tool to study the source code of an application and detect and address performance bottlenecks at the very early stages of development that don't show up later. Many profiling tools in Java, .NET, and Python are capable of quickly identifying how an application executes, making programmers focus on problems that cause poor performance. The end result of selecting and using the right code profiling tools is an optimized code base that meets client requirements and business demands. Blind Optimizations Will Only Waste Time Code optimization without the right code profiling tools can become problematic because a developer will often incorrectly diagnose the potential bottlenecks with false assumptions. We will probably see a list of five to ten methods that are much larger than the rest and inspecting the code line by line is not possible without a code profiling tool. Blind optimizations in code profiling can be costly due to their possible negative effect on overall effectiveness as well as application performance. Blind optimizations involve developers altering code without knowing the application functionality completely, how it works, and its adverse effects. As a result, blind optimizations may cause unexpected problems or degrade the performance of other areas of the code. Moreover, blind optimizations may not address the underlying cause of performance degradation and may only provide temporary fixes. Blind optimizations in code profiling can be dangerous because they often result in an inefficient use of computational resources that can lead to longer execution times and more resource consumption, resulting in reduced performance and additional costs which can also introduce new performance problems and degrade overall performance. Always Measure The Application Performance Before You Optimize To uncover the performance problems, we first need to find a performance testing tool to actually conduct a load test and identify the transactions with high response times. Before we run an analysis, we need to have a test plan that describes the sequence of user actions or API calls, web service calls that we will make, and the data that will be passed in the load test to measure the application performance. Many of our optimizations are based on assumptions about which parts of the code are likely slow. Remove the spaghetti code, make the changes on the code, rerun the load tests with the same settings, correlate the test runs and you will typically find solutions to all the performance problems. For example, if you think your database connection is slow, log your database calls and read through transaction logs. If you think your algorithm is slow, we have to use a profiling tool to find out exactly which part of the code is going slow. We have to measure the application performance and profile frequently, as this is particularly important when optimizing any complex code. Developers have to be very analytical while optimizing the code for better performance, or else, it becomes a time-consuming exercise that can introduce many new performance issues. When To Start Code Profiling From my experience, the best time to start profiling is "when a performance problem has been found, generally during load test or in a live system and the developers have to react and fix the problem by doing code profiling using IDE integration with that source code." You can actively identify and fix any performance bottlenecks by starting code profiling early in the development process, which will ultimately result in a simpler and better-performing codebase. Moreover, it is important for developers to start code profiling early in the development process, preferably during those initial testing phases, if you want to find problems with performance early on and prevent them from getting further embedded in the codebase. Developers and performance engineers have to conduct load tests to reproduce the problem, understand how the profiler works, learn how to use it, collect and interpret the results, revisit the source code, and confirm and fix the problem that improves performance. As soon as we have something readily available to test, we have to do load tests in parallel with the development to make sure the performance issues found are fixed early. All the developers and performance engineers must incorporate code profiling into the performance engineering process during the initial development and testing phases, to continuously monitor and improve the performance of your application as your code will undergo lot many changes. Setup and Training on Code Profiling Tools As a performance engineer, I do code profiling and optimizations once in a while and I mostly work with Java and .NET applications performance testing and engineering. Before I start profiling, I keep an eye on the performance tab. When performance takes a hit, I analyze to see whether it is an anomaly or a genuine performance issue that must be addressed. For example, we monitor where requests are taking >1s or background jobs are taking longer than expected and then we do profiling on those particular transactions. Due to a lack of training on how to use and set up several code profiling tools, many developers and performance engineers are still not clear on when and how to use code profiling tools. Code profiling should ideally be executed upon the development of each unit/method. There are various open-source and commercial third-party profiling tools available in the market. These need to be evaluated before purchase in order to determine the tool best suited for a particular technology/platform. Following are some of the industry-standard profiling tools. Java: JProfiler, JMC/JFR, JConsole, JVisualVM, YourKit Profiler, JProbe, etc. .NET: JetBrains dotTrace, Redgate ANTS Performance and Memory Profilers, CLR Profiler, MEM Profiler, DevPartner, Visual Studio Profiling Tools, etc. Python: timeit, cProfile, PyInstrument, etc. Why Should We Worry About Profiler Overhead? Any profiler we choose will add overhead to both the application being measured and the machine that it is running on. The amount of overhead varies depending on the type of profiler and profiling method that is used. In the case of a performance profiler, the process of measuring may influence the performance being measured. This is especially true for instrumenting profilers, which require modifying the application binary to incorporate timing probes into each function. As a result, there is more code to run, which requires more CPU and memory, resulting in greater overhead. If your application is already memory and CPU-intensive, things will likely worsen, and it may be impossible to analyze the entire application. The developers and performance engineers have to carefully understand which one to profile and what profiling type will help to save resources and get accurate information to deal with performance problems on time. For example, the overhead for instrumentation is very high when compared to the sampling profiling type. The Flip Side: What Experts Are Saying About Code Profiling Many architects and dev champions say focusing on micro-optimizations through code profiling can often result in ignoring higher-level architectural and design enhancements that could have a greater influence on overall performance. On the other hand, others believe that code profiling is a time-consuming and resource-intensive process. Profiling code itself can frequently create overhead, ultimately affecting the findings and leading to incorrect conclusions about the application's performance. Critics of code profiling further argue that excessive dependence on profiling tools might lead to developers prioritizing isolated performance improvements over other crucial aspects of software development, such as maintainability, readability, and extensibility. This tight focus on code profiling might result in a trade-off between code quality and system architecture. Conclusion It’s inevitable that application performance problems are going to happen. But problems can come from anywhere, and sometimes you just need to know where to look. However, no matter how careful and diligent you are, things are going to happen. Both developers and performance engineers should learn how to profile an application and identify potential problems that will allow us to write better code that gives the desired performance. Frequently testing the functionality we developed using a profiler and looking for common bottlenecks and issues will allow us to find and fix many small issues that may otherwise become more serious issues later on in production. Running load tests as early as possible during development, and running these tests regularly with the latest builds, allows us to identify problems as soon as they occur and it can also highlight when a change has introduced a problem. Code profiling is not just limited just to developers and it is everyone's job to improve the efficiency of the code. More
Comparing Pandas, Polars, and PySpark: A Benchmark Analysis

Comparing Pandas, Polars, and PySpark: A Benchmark Analysis

By Nacho Corcuera
Lately, I have been working with Polars and PySpark, which brings me back to the days when Spark fever was at its peak, and every data processing solution seemed to revolve around it. This prompts me to question: was it really necessary? Let’s delve into my experiences with various data processing technologies. Background During my final degree project on sentiment analysis, Pandas was just beginning to emerge as the primary tool for feature engineering. It was user-friendly and seamlessly integrated with several machine learning libraries, such as scikit-learn. Then, as I started working, Spark became a part of my daily routine. I used it for ETL processes in a nascent data lake to implement business logic, although I wondered if we were over-engineering the process. Typically, the data volumes we handled were not substantial enough to necessitate using Spark, yet it was employed every time new data entered the system. We would set up a cluster and proceed with processing using Spark. In only a few instances did I genuinely feel that Spark was not the right tool for the job. This experience pushed me to develop a lightweight ingestion framework using Pandas. However, this framework did not perform as expected, struggling with medium to large files. Recently, I've started using Polars for some tasks and I have been impressed by its performance in processing datasets with several million rows. This has led to me setting up a different benchmarking for all of these tools. Let's dive into it! A Little Bit of Context Pandas We don't have to forget that Pandas has been the dominant tool for data manipulation, exploration, and analysis. Pandas has risen in popularity among data scientists thanks to its similarities with the R grid view. Moreover, it is synchronized with other Python libraries related to the machine learning field: NumPy is a mathematical library for implementing linear algebra and standard calculations. Pandas is based on NumPy. Scikit-learn is the reference library for machine learning applications. Normally, all the data used for the model has been loaded, visualized, and analyzed with Pandas or NumPy. PySpark Spark is a free and distributed platform that transforms the paradigm of how big data processing is done, with PySpark as its Python library. It offers a unified computing engine with exceptional features: In-memory processing: Spark's major feature is its in-memory architecture, which is fast as it keeps the data in memory rather than on disk. Fault tolerance: The failure tolerance mechanisms that are built into the software ensure dependable data processing. Resilient Distributed Datasets do data tracking and allow for automatic recovery in case of failures such as failures. Scalability: Spark’s horizontally scalable architecture processes large datasets adaptively and distributes much faster to clusters. The data is distributed, using the massive power of all nodes in the cluster. Polars Polars is a Python library built on top of Rust, combining the flexibility and user-friendliness of Python with the speed and scalability of Rust. Rust is a low-level language that prioritizes performance, reliability, and productivity. It is memory efficient and gives performance par with C and C++. On the other hand, Polars uses Apache Arrow as its query engine to execute vectorized queries. Apache Arrow is a cross-language development platform for fast in-memory processing. Polars enable instantaneity in executing the operations of tabular data manipulation, analysis, and transformation, favoring its utilization with large datasets. Moreover, its syntax is like SQL, and the expressive complexity of data processing is easy to demonstrate. Another capability is its lazyness which evaluates queries and applies query optimization. Benchmarking Set up Here is a link to the GitHub project with all the information. There are four notebooks for each tool (two for polars for testing eager and lazy evaluation). The code will extract time execution for the following tasks: Reading Filtering Aggregations Joining Writing There are five datasets with multiple sizes, 50,000, 250,000, 1,000,000, 5,000,000, and 25,000,000 of rows. The idea is to test different scenarios and sizes. The data used for this test is a financial dataset from Kaggle. The tests were executed in: macOS Sonoma Apple M1 Pro 32 GB Table of Execution Times Row Size Pandas Polars Eager Polars Lazy PySpark 50,000 Rows 0.368 0.132 0.078 1.216 250,000 Rows 1.249 0.096 0.156 0.917 1,000,000 Rows 4.899 0.302 0.300 1.850 5,000,000 Rows 24.320 1.605 1.484 7.372 25,000,000 Rows 187.383 13.001 11.662 44.724 Analysis Pandas performed poorly, especially as dataset sizes increased. However, it could handle small datasets with decent performance time. PySpark, while being executed in a single machine, shows considerable improvement over Pandas when the dataset size grows. Polars, both in eager and lazy configurations, significantly outperforms the other tools, showing improvements up to 95-97% compared to Pandas and 70-75% compared to PySpark, confirming its efficiency in handling large datasets on a single machine. Visual Representations These visual aids help underline the relative efficiencies of the different tools across various test conditions. Conclusion The benchmarking results provided offer a clear insight into the performance scalability of four widely-used data processing tools across varying dataset sizes. From the analysis, several critical conclusions emerge: Pandas performance scalability: Popular for data manipulation in smaller datasets, it struggles significantly as the data volume increases indicating it is not the best for high-volume data. However, its integration over a lot of Machine Learning and stadistic libraries makes it indispensable for Data Science teams. Efficiency of Polars: Configurations of Polars (Eager and Lazy) demonstrate exceptional performance across all tested scales, outperforming both Pandas and PySpark by a wide margin, making Polars an efficient tool capable of processing large datasets. However, Polars has not released yet a major version of Python and until that, I don't recommend it for Production systems. Tool selection strategy: The findings underscore the importance of selecting the right tool based on the specific needs of the project and the available resources. For small to medium-sized datasets, Polars offers a significant performance advantage. For large-scale distributed processing, PySpark remains a robust option. Future considerations: As dataset sizes continue to grow and processing demands increase, the choice of data processing tools will become more critical. Tools like Polars built over Rust are emerging and the results have to be considered. Also, the tendency to use Spark as a solution for processing everything is disappearing and these tools are taking their place when there is no need for large-scale distributed systems. Use the right tool for the right job! More
Advanced Linux Troubleshooting Techniques for Site Reliability Engineers
Advanced Linux Troubleshooting Techniques for Site Reliability Engineers
By Prashanth Ravula DZone Core CORE
Practical Use Cases With Terraform in Network Automation
Practical Use Cases With Terraform in Network Automation
By Karthik Rajashekaran
How To Manage Terraform Versions
How To Manage Terraform Versions
By Alexander Sharov
Master the Art of Querying Data on Amazon S3
Master the Art of Querying Data on Amazon S3

In an era where data is the new oil, effectively utilizing data is crucial for the growth of every organization. This is especially the case when it comes to taking advantage of vast amounts of data stored in cloud platforms like Amazon S3 - Simple Storage Service, which has become a central repository of data types ranging from the content of web applications to big data analytics. It is not enough to store these data durably, but also to effectively query and analyze them. This enables you to gain valuable insights, find trends, and make data-driven decisions that can lead your organization forward. Without a querying capability, the data stored in S3 would not be of any benefit. To avoid such scenarios, Amazon Web Services (AWS) provides tools to make data queries accessible and powerful. Glue Crawler is best suited to classify and search data. Athena is a service used to make quick ad hoc queries. Redshift Spectrum is considered a solid analyst capable of processing complex queries at scale. Each tool has its niche and provides a flexible approach for querying data according to your needs and the complexity of the tasks. Exploring Glue Crawler for Data Cataloging With the vast quantities of data stored on Amazon S3, finding an efficient way to sort and make sense of this data is important. This leads us to Glue Crawler. It is like an automated librarian who can organize, classify, and update library books without human intervention. Glue Crawler does the same with Amazon S3 data. It automatically scans your storage, recognizes different data formats, and suggests schemas in the AWS Glue Data Catalog. This process simplifies what would otherwise be a hard manual task. Glue Crawler generates metadata tables by crawling structured and semi-structured data to organize it for query and analysis. The importance of a current data catalog cannot be exaggerated. A well-maintained catalog serves as a road map for stored data. An updated catalog ensures that when you use tools such as Amazon Athena or Redshift Spectrum, you use the most current data structure to streamline the query process. In addition, a centralized metadata repository improves collaboration between teams by providing a common understanding of the layout. To make the most of your Glue Crawler, here are some best practices: Classify Your Data Use classifiers to teach Glue Crawler about the different data types. Whether JSON, CSV, or Parquet, the accurate classification ensures the schema created is as meticulous as possible. Schedule Regular Crawls Data changes over time, so scheduled crawls are performed to keep the catalog updated. This can be done daily, weekly, or even after a particular event, depending on how frequently your data is updated. Use Exclusions Not all data must be crawled. Set temporary or redundant file exclusion patterns to save time and reduce costs. Review Access Policies Check that the correct permissions are in place. Crawlers need access to the data they expect to crawl, and users need the right permissions to access the updated catalog. By following these tips, you can ensure that Glue Crawler works harmoniously with your data and improves the data environment. Adopting these best practices improves the data discovery process, and lays a solid foundation for the next step in the data query process. Harnessing the Power of Amazon Athena for Query Execution Imagine a scenario in which you are sorting through an enormous amount of data looking for that decisive insight hidden deep inside. Imagine doing this in just a few clicks and commands, without complex server configurations. Amazon Athena, an interactive query service is tailor-made for this - it can analyze data directly on Amazon S3 using standard SQL. Amazon Athena is similar to having a powerful search engine for data lakes. It is serverless, meaning you do not have to manage the underlying infrastructure. You don't need to set up or maintain servers, you only pay for the queries you run. Athena automatically scales, executes queries in parallel, and generates quick results even with large amounts of data and complex queries. The advantages of Amazon Athena are numerous, especially in the context of ad hoc queries. First, it provides simplicity. With Athena, you can start querying data using standard SQL without learning new languages or managing infrastructure. Secondly, there is the cost aspect. You pay per query; i.e., pay only for the data scanned by your query, making it a cost-effective option for all kinds of use cases. Finally, Athena is very flexible, and you can query data in various formats such as CSV, JSON, ORC, Avro, and Parquet directly from S3 buckets. To maximize Athena's benefits, consider these best practices: Compress your data: Compressing your data can significantly reduce the data scanned by each query, resulting in faster performance and lower costs. Use columnar formats: store data in columnar formats such as Parquet or ORC. These formats are optimized for high-performance reading and help reduce costs by scanning only the columns required for your query. Partition your data: By partitioning your data according to commonly filtered columns, Athena can skip unnecessary data partitions, improve performance, and reduce the amount of data scanned. Avoid Select *: Be specific about the required columns. Using "SELECT *" can scan more data than necessary. By following these best practices, you will be able to improve the performance of your queries, as well as manage costs. As mentioned in the previous section, having well-organized and classified data is essential. Athena benefited directly from this organization, and if the underlying data was properly structured and indexed, it could be processed more efficiently. Leveraging Redshift Spectrum for Scalable Query Processing Redshift Spectrum is an extension of Amazon's cloud data warehouse service Redshift. It allows users to perform SQL queries directly on the data stored in Amazon S3 without prior data load or conversion. This function can analyze large amounts of structured and unstructured data in Redshift. The integration is seamless; point the Redshift spectrum to the S3 data lake, define a schema, and start querying using standard SQL. Traditional data warehouse solutions often require significant pre-processing and data movement before analysis. This not only increases complexity but can also delay understanding. On the contrary, Redshift Spectrum offers more agile approaches. You keep your data where it is – in Amazon S3, and give it the computing power. This method eliminates the time-consuming ETL (extraction, transformation, load) process and opens the door to real-time analytics at scale. Furthermore, because you pay only for the queries you run, you can save significantly compared to traditional solutions, where hardware and storage costs are a factor. Several tactics can be utilized to maximize the benefits of Redshift Spectrum. Initially, arranging data in a columnar structure increases effectiveness since it enables Redshift Spectrum to access the required columns only during a query. Dividing data according to frequently requested columns can also enhance performance by reducing the amount of data that needs to be examined. Moreover, consider the size of the files stored in S3: smaller files can result in higher overhead, whereas large files may not be easily parallelized. Striking the right balance is key. Another factor to consider in cost-efficient querying is controlling the volume of data scanned during each query. To minimize Redshift Spectrum charges, you should restrict the amount of data scanned by utilizing WHERE clauses to filter out unnecessary data, thereby decreasing the data volume processed by Redshift Spectrum. Finally, continuously monitoring and analyzing query patterns can aid in pinpointing chances to improve data structures or query designs for enhanced performance and reduced expenses. Conclusion As we conclude, it is crucial to consider the main points. In this article, we have explored the intricacies of retrieving information from Amazon S3. We understood the significance of having a strong data catalog and how Glue Crawler streamlines its development and upkeep. We also examined Amazon Athena, a tool that enables quick and easy serverless ad-hoc querying. Finally, we discussed how Redshift Spectrum expands on the features of Amazon Redshift by allowing queries on S3 data and providing a strong option in place of conventional data warehouses. These tools are more than just standalone units - they are components of a unified ecosystem that, when combined, can create a robust framework for analyzing data.

By Satrajit Basu DZone Core CORE
GitHub Copilot Tutorial
GitHub Copilot Tutorial

This article describes the GitHub Copilot tool and the main guidelines and assumptions regarding its use in software development projects. The guidelines concern both the tool’s configuration and its application in everyday work and assume the reader will use GitHub Copilot with IntelliJ IDEA (via a dedicated plugin). GitHub Copilot: What Is It? GitHub Copilot is an AI developer assistant that uses a generative AI model trained for all programming languages available in GitHub repositories. The full description and documentation of the tool is available here. There are other similar tools on the market, such as OpenAI Codex, JetBrains AI Assistant, or Tabnine, but GitHub Copilot stands out due to the following features: The largest and most diverse collection for training an AI model – GitHub repositories Estimated usage share – currently approx. 40-50% (according to Abhay Mishra’s article based on undisclosed industry insights), but the market is very dynamic Support for popular technologies – we’ve tested it with the Java programming language, Scala, Kotlin, Groovy, SQL, Spring, Dockerfile, OpenShift, Bash Very good integration with the JetBrains IntelliJ IDEA IDE Low entry-level due to quick and easy configuration, general ease of use, clear documentation, and many usage examples on the internet A wide range of functionalities, including: Suggestions while writing code Generating code based on comments in natural language Taking existing code into account when generating a new code snippet Creating unit tests Chat – allows you to ask questions regarding code, language, and technology, as well as suggests corrections for simplifying the code CLI – support for working in the console and creating bash scripts Our Goals Our main goal for using GitHub Copilot was to improve the efficiency of writing code and its quality. In addition, we intended it to support and assist us in work in which programmers lack knowledge and experience. Here are the specific goals that we wanted our development team to achieve by using GitHub Copilot: 1. Accelerating Development Generating code fragments Generating SQL queries Hints for creating and modifying OpenShift and Dockerfile configuration files Faster search for solutions using the chat function, e.g., explanation of regular expressions, operation of libraries or framework mechanisms 2. Improving Code Quality Generating unit tests with edge cases – both in Java and Groovy languages Suggesting corrections and simplifications in our own code 3. Working With Less Frequently Used Technologies Explaining and generating code (including unit tests) in Scala and Kotlin Support while using “legacy” solutions like Activiti, etc. Support in creating and understanding configuration files 4. More Efficient Administrative Work in the Console Using CLI Functions Tool Limitations Guidelines Since GitHub Copilot is based on generative AI, you must always remember that it may generate incorrect code or responses. Therefore, when using the tool, you must be aware of potential limitations and apply the principle of limited trust and verification. The main limitations are presented in the table below. Limitation Description Limited scope of knowledge The tool is based on code found in GitHub repositories. Some problems, or complex structures, languages or data notations, have poor representation in the training sets Dynamic development and features in the beta phase The tool is developing very dynamically. Patches and updates appear every week or every several weeks, which indicates that many elements of the tool are not working properly. Some functionalities, such as GitHub Copilot CLI, are still in beta Inaccurate code The tool provider informs that the generated code may not meet the user’s expectations, may not solve the actual problem, and may contain errors Inaccurate chat responses When using chat, the accuracy of the answer depends largely on the question or command formulated. The documentation says that “Copilot Chat is not designed to answer non-coding questions”, so there are possible answers, especially in areas not strictly related to the code (design, etc.), that will not be appropriate or even sensible Dangerous code The training set (repositories) may also contain code elements that violate security rules, both in the security and safety sense, such as API keys, network scanning, IP addresses, code that overloads resources or causes memory leaks, etc. To minimize the negative impact of the identified GitHub Copilot limitations, you should always: Check alternative suggestions (using Ctrl+[ and Ctrl+], etc.) and choose the ones that best suit a given situation Read and analyze the correctness of the generated code Test and run code in pre-production environments – primarily locally and in the development environment Submit the generated code to code review Important: Never deploy the code generated by GitHub Copilot to production environments without performing the above checks. Configuration Guidelines In this section, we’ll present the basic information regarding the pricing plans (with advantages and disadvantages for each option, as seen from the perspective of our intended goals) and personal account configuration (for both GitHub Copilot and the IntelliJ IDEA plugin). Pricing Plans GitHub Copilot offers three subscription plans with different scopes of offered functionality and cost. In our case, two plans were worth considering: Copilot Individual or Copilot Business. The Copilot Enterprise plan additionally offers access to chat via the github.com website and generating summaries for pull requests, which was unimportant for our assumed goals (but it may be different in your case). Both plans’ main advantages and disadvantages are presented in the table below. Plan Advantages Disadvantages GitHub Copilot Individual Lower cost at $10/month/user Offers the key functionality required to achieve the intended goals Lack of control over tool configuration and user access by the organization GitHub Copilot Business Offers the key functionality required to achieve the intended goals Control over tool configuration and user access by the organization Higher cost at $19/month/user In our case, Copilot Business was the better option, especially because it allows full control over the configuration and access to the tool for developers in the team. If you’re working on your own, the Copilot Individual plan might be enough. Account Configuration You can configure GitHub Copilot when purchasing a subscription plan, and the settings can also be changed after activating the account in the organization’s account settings on GitHub. At the account level, there were two key parameters for our use case to configure in GitHub Copilot, described in the table below. Option name Description Recommended settings Suggestions matching public code Available options: Allowed and Blocked Determines whether to show or to block code suggestions that overlap around 150 lines with public code Blocked This option reduces the risk of duplicating code from public repositories, thus reducing the uncertainty about the copyright ownership of the code Allow GitHub to use my code snippets for product improvements Available options: Yes and No Determines whether GitHub, its affiliates, and third parties may use user code snippets to explore and improve GitHub Copilot suggestions, related product models, and features No If you plan to use GitHub Copilot for commercial purposes, GitHub and its associated entities should not use user code due to copyright considerations Here is a detailed description and instructions for changing configuration options in your GitHub account. IntelliJ IDEA Plugin Configuration To enable GitHub Copilot in the IntelliJ IDEA IDE, you must install the GitHub Copilot extension from the Visual Studio Code marketplace. Installation is done via the IDE in the plugin settings. After installation, log in to your GitHub account with your device code. You can find detailed instructions for installing and updating the plugin here. The GitHub Copilot plugin for the IntelliJ IDEA IDE offers the ability to configure the following things: Automatic submission of suggestions The way suggestions are displayed Automatic plugin updates Supported languages Keyboard shortcuts In our case, using the default plugin settings was recommended because they ensure good working comfort and are compatible with the existing tool documentation. Any changes to the configuration can be made by each user according to their own preferences. Our GitHub Copilot plugin settings in IntelliJ IDEA Our keymap settings for GitHub Copilot in IntelliJ IDEA How To Use GitHub Copilot in IntelliJ Here are some guidelines for using key functionalities that will help you use the GitHub Copilot tool optimally. Generating Application Code When To Use Creating classes Creating fields, methods, constructors Writing code snippets inside methods How To Use By writing code and using automatic suggestions – it’s always worth checking other suggestions using the Ctrl+] / Ctrl+[ keys By writing concise and precise comments in natural English Using the chat function – the chat can generate a fragment of code in response to a query (see examples in the section “Using the GitHub Copilot Chat” below) and allows you to quickly generate code using the Copy Code Block or Insert Code Block at Cursor buttons that appear in the section with code in the chat window Writing Unit Tests When To Use Creating new classes and methods that we want to cover with unit tests Coverage of existing classes and methods with unit tests How To Use By writing a comment in the test class. For example, if you write // Unit test in JUnit for CurrencyService, you will get the following result: It is possible to generate individual test methods by entering in the comment the test case that the method is to test. Similarly, you can generate mocks in the test class. Using the chat – you can select the GitHub Copilot > Generate Test option from the context menu, enter the /tests command, or write an instruction in a natural language, e.g., Generate unit test for class CurrencyService. In response, you will receive a descriptive explanation of the test structure and the code of the entire test class: Generating SQL Queries and Stored Procedures When To Use When writing DDL, DML, and DQL queries that will be used in the application During data analysis and errors related to data in the database When writing scripts and stored procedures How To Use IMPORTANT: you must have a database connection configured in IntelliJ IDEA or DataGrip By writing queries and using automatic suggestions By writing a comment, e.g. if you write – – get party data for account, you will get the following result: Creating OpenShift Configuration or Other Configuration Files When To Use Creating or modifying configuration files Analysis of directives, their options and values, and configuration mechanisms How To Use By writing directives and using automatic suggestions Using the chat – you can select the directive and choose GitHub Copilot > Explain This from the context menu, enter the /explain command, or write a query in natural language about a given configuration element Using the BASH Console When To Use When trying to use obscure console commands For an explanation of command operation and its options To find the right command to perform a task When writing BASH scripts How To Use IMPORTANT: to use the CLI tool, install GitHub CLI with the gh-copilot extension according to the instructions Currently, the tool offers two commands, summarized in the table below Command Example Result gh copilot suggest # gh copilot suggest “find IP number in text file” grep -E -o ‘([0-9]{1,3}\.){3}[0-9]{1,3}’ <filename> gh copilot explain # gh copilot explain “curl -k” curl is used to issue web requests, e.g., download web pages –k or –insecure allows curl to perform insecure SSL connections and transfers How To Use GitHub Copilot Chat We’ve written a separate chapter for the GitHub Copilot Chat – as there are several use cases worth talking about. Let’s go through them individually and discuss specific guidelines for each case. Creating New Functionalities When To Use When you are looking for a solution to a problem, such as creating a website, a method that performs a specific task, error handling for a given block of code/method/class, etc. How To Use Enter a query in natural English regarding the functionality you are looking for. It should concern topics related to programming – code, frameworks/libraries, services, architecture, etc. Below is an example for the query: How to get currency exchange data? Using Regular Expressions When To Use When you need to create and verify a regular expression How To Use Enter a query in natural English regarding the pattern you are looking for. The example below shows a generated method with an incorrect pattern, a query, and a response with an explanation and corrected code Finding Errors in the Code When To Use When you create new classes or methods When analyzing a class or method that causes errors How To Use You can select the code and choose GitHub Copilot > Fix This from the context menu, enter the /fix command, or write an instruction in natural English, e.g., Find possible errors in this class. You can specify a command to a method name or error type. For example, for a simple class, explanations of potential errors were obtained, and the chat generated code to handle these errors: Explanation of Existing Code When To Use When you don’t understand what exactly a module, class, method, piece of code, regular expression, etc., does When you don’t know the framework or library mechanism used How To Use In a class or method, you can select GitHub Copilot > Explain this from the context menu, type the /explain command, or write a query in natural English about the problematic code element, e.g., Explain what is this class doing. The example below presents an explanation of the class and its methods. This applies to the class generated in the bug-finding example Simplify Existing Code When To Use When the code is complicated and difficult to understand or unnecessarily extensive When refactoring the code How To Use In a class or selected method or code fragment, you can select GitHub Copilot > Simplify This from the context menu, type the /simplify command, or write a query in natural English. An example of a simple method of refactoring for a class is below: The result: Summary: A Powerful Tool, as Long as You’re Cautious As you can see, GitHub Copilot can be a powerful tool in a software developer’s arsenal. It can speed up and simplify various processes and day-to-day tasks. However, as with all things related to generative AI, you can never fully trust this tool – therefore, the crucial rule is to always read, review, and test what it creates.

By Karol Świder
Data Migration With AWS DMS and Terraform IaC
Data Migration With AWS DMS and Terraform IaC

Data is the new oil—a saying I often hear, and it couldn't be more accurate in today's highly interconnected world. Data migration is crucial for organizations worldwide, from startups aiming to scale rapidly to enterprises seeking to modernize IT infrastructure. However, as a tech enthusiast, I've often found myself navigating the complexities of large volumes of data across different environments. A data migration that is not well planned or executed, whether it is a one-time event or ongoing replication, is done manually, not automated using any scripts, or not tested well, which can potentially cause issues during the migration and increase the delay or downtime. To take this challenge head-on, I've interacted with several technology heads to ease data migration journeys and understand how AWS DMS streamlines data migration journeys. AWS DMS sets up a platform to execute migrations effectively with minimal downtime. I've also realized that we can completely automate this process using Terraform IAC to trigger migration for any supported source database to the target database. Using Terraform, we can create an infrastructure required for target nodes and AWS DMS resources, which can complete the data migration automatically. In this blog, we'll dive deep into the intricacies of data migration using AWS DMS and Terraform IAC. In this blog, we'll learn: What is AWS Data Migration Service (AWS DMS)? How to Automate Data Migration using AWS DMS and Terraform IAC Key Benefits and Features of AWS DMS? Let's get started! 1. What Is AWS DMS (Database Migration Service)? AWS DMS (Database Migration Service) is a cloud-based tool that facilitates database migration to the AWS Cloud by replicating data from any supported source to any supported target. It also supports continuous data capture (CDC) functionality, which replicates data from source to target on an ongoing basis. AWS DMS Architectural Overview Use Cases of AWS DMS AWS Database Migration Service (AWS DMS) supports many use cases, from like-to-like migrations to complex cross-platform transitions. Homogeneous Data Migration Homogeneous database migration migrates data between identical or similar databases. This one-step process is straightforward due to the consistent schema structure and data types between the source and target databases. Homogeneous Database Migration Heterogeneous Database Migration Heterogeneous database migration involves transferring data between different databases, such as Oracle to Amazon Aurora, Oracle to PostgreSQL, or SQL Server to MySQL. This process requires converting the source schema and code to match the target database. Using the AWS Schema Conversion Tool, this migration becomes a two-step procedure: schema transformation and data migration. Source schema and code conversion involve transforming tables, views, stored procedures, functions, data types, synonyms, etc. Any objects that the AWS Schema Conversion Tool can't automatically convert are clearly marked for manual conversion to complete the migration. DMS Schema Conversion Heterogeneous Database Migrations Prerequisites for AWS DMS The following are prerequisites for AWS DMS data migration Access to source and target endpoints through firewall and security groups Source endpoint connection Target endpoint connection Replication instance Target schema or database CloudWatch event to trigger the Lambda function Lambda function to start the replication task Resource limit increase AWS DMS Components Before migrating to AWS DMS, let's understand AWS DMS components. Replication Instance Replication instances are managed by Amazon EC2 instances that handle replication jobs. They connect to the source data store, read and format the data for the target, and load it into the target data store. Replication Instance Source and Target Endpoints AWS DMS uses endpoints to connect to source and target databases, allowing it to migrate data from a source endpoint to a target endpoint. Supported Source Endpoints Include: Supported source endpoints include Google Cloud for MySQL, Amazon RDS for PostgreSQL, Microsoft SQL Server, Oracle Database, Amazon DocumentDB, PostgreSQL, Microsoft Azure SQL Database, IBM DB2, Amazon Aurora with MySQL compatibility, MongoDB, Amazon RDS for Oracle, Amazon S3, Amazon RDS for MariaDB, Amazon RDS for Microsoft SQL Server, MySQL, Amazon RDS for MySQL, Amazon Aurora with PostgreSQL compatibility, MariaDB, and SAP Adaptive Server Enterprise (ASE). Supported Target Endpoints Include Supported target endpoints include PostgreSQL, SAP Adaptive Server Enterprise (ASE), Google Cloud for MySQL, IBM DB2, MySQL, Amazon RDS for Microsoft SQL Server, Oracle Database, Amazon RDS for MariaDB, Amazon Aurora with MySQL compatibility, MariaDB, Amazon S3, Amazon RDS for PostgreSQL, Microsoft SQL Server, Amazon DocumentDB, Microsoft Azure SQL Database, Amazon RDS for Oracle, MongoDB, Amazon Aurora with PostgreSQL compatibility, Amazon RDS for MySQL, and Amazon RDS for Microsoft SQL Server. Replication Tasks Replication tasks facilitate smooth data transfer from a source endpoint to a target endpoint. This involves specifying the necessary tables and schemas for migration and any special processing requirements such as logging, control table data, and error handling. Creating a replication task is a crucial step before starting the migration, which includes defining the migration type, source and target endpoints, and the replication instance. A replication task includes three main migration types: Total Load: Migrates existing data only. Full Load with CDC (Change Data Capture): Migrates existing data and continuously replicates changes. CDC Only (Change Data Capture): Continuously replicates only the changes in data. Validation Only: Focuses solely on data validation. These types lead to three main phases: Migration of Existing Data (Full Load): AWS DMS transfers Data from the source tables to the target tables. Cached Changes Application: While the total load is in progress, changes to the loading tables are cached on the replication server. Once the total load for a table is complete, AWS DMS applies the cached changes. Ongoing Replication (Change Data Capture): Initially, a transaction backlog delays the source and target databases. Over time, this backlog is processed, achieving a steady migration flow. This detailed explanation ensures that AWS DMS methodically guides the data migration process, maintaining data integrity and consistency. CloudWatch Events AWS CloudWatch EventBridge delivers notifications about AWS DMS events, such as replication task initiation/deletion and replication instance creation/removal. EventBridge receives these events and directs notifications based on predefined rules. Lambda Function We use an AWS Lambda function to initiate replication tasks. When an event signaling task creation occurs in AWS DMS, the Lambda function is automatically triggered by the configured EventBridge rules. Resource Limits In managing AWS Database Migration Service (DMS), we adhere to default resource quotas, which serve as soft limits. With assistance from AWS support tickets, these limits can be increased as needed to ensure optimal performance. Critical AWS DMS resource limits include: Endpoints per user account: 1000 (default) Endpoints per replication instance: 100 (default) Tasks per user account: 600 (default) Tasks per replication instance: 200 (default) Replication instances per user account: 60 (default) For example, to migrate 100 databases from an On-Prem MySQL source to RDS MySQL, we use the following calculation: Tasks per database: 1 Endpoints per database: 2 Endpoints per replication instance: 100 Total tasks per replication instance = Endpoints per replication instance / Endpoints per database = 100 / 2 = 50. This means we can migrate up to 50 databases per replication instance. Using two replication instances, we can migrate all 100 databases efficiently in one go. This approach exemplifies the strategic use of resource quotas for effective database migration. How To Automate Data Migration With Terraform IaC: Overview Terraform and DMS automate and secure data migration, simplifying the process while managing AWS infrastructure efficiently. Here's a step-by-step overview of this seamless and secure migration process: Step 1: Fetching Migration Database List Retrieve a list of databases to be migrated. Step 2: Database Creation (Homogeneous Migration) Create target schema or database structures to prepare for data transition in case of homogeneous data migrations. Step 3: Replication Subnet Group Creation Create replication subnet groups to ensure seamless network communication for data movement. Step 4: Source/Target Connection Endpoints Equip each database set for migration with source and target connection. Step 5: Replication Instance Creation Create replication instances to handle the data migration process. Step 6: Lambda Integration With Cloud Watch Events Integrate a CloudWatch event and Lambda function to initiate replication tasks. Step 7: Replication Task Creation and Assignment Create and assign replication tasks to replication instances, setting up the migration. Step 8: Migration Task Initiation Migration tasks are initiated for each database. Migration Process & Workflow Diagram Architecture Overview for Data Migration Automation AWS DMS with Terraform Infrastructure as Code (IAC) automates the data migration. The data migration automation process begins with the dynamic framework of Jenkins pipelines. This framework uses various input parameters to customize and tailor the migration process, offering flexibility and adaptability. Here's a detailed overview of the architecture: AWS DMS Architecture with Terraform IAC Step 1: Jenkins Pipeline Parameters The Jenkins pipeline for AWS DMS starts by defining essential input parameters, such as region and environment details, Terragrunt module specifics, and migration preferences. Key input parameters include: AWS_REGION: Populates the region list from the repository. APP_ENVIRONMENT: Populates the application environment list from the repository. TG_MODULE: Populates the Terragrunt module folder list from the repository. TG_ACTION: Allows users to select Terragrunt actions from plan, validate, and apply). TG_EXTRA_FLAGS: Users can pass Terragrunt more flags. FETCH_DBLIST: Determines the migration DB list generation type (AUTOMATIC and MANUAL). CUSTOM_DBLIST: SQL Server custom Database list for migration if FETCH_DBLIST is selected as MANUAL. MIGRATION_TYPE: Allows users to choose the DMS migration type (full-load, full-load-and-cdc, cdc). START_TASKS: Allows users to turn migration task execution on or off. TEAMS: MS Teams channel for build notifications. Step 2: Execution Stages Based on the input parameters, the pipeline progresses through distinct execution stages: Source Code Checkout for IAC: The pipeline begins by checking out the source code for IAC, establishing a solid foundation for the following steps. Migration Database List: Depending on the selected migration type, the pipeline automatically fetches the migration database list from the source instance or uses a manual list. Schema or Database Creation: The target instance is created by creating the necessary schema or database structures for data migration. Terraform/Terragrunt Execution: The pipeline executes Terraform or Terragrunt modules to facilitate the AWS DMS migration process. Notifications: Updates are sent via email or MS Teams throughout the migration process. Step 3: Automatic and Manual List Fetching Fetched migration database list automatically from the source instance using a shell script and keeping FETCH_DBLIST automatic. Alternatively, users can manually provide a selective list for migration. Step 4: Migration Types The Terraform/Terragrunt module initiates CDC, full-load-and-cdc, and full-load migrations based on the specified migration type in MIGRATION_TYPE. Step 5: Automation Control Initiate the migration task, either manually or automatically, with START_TASKS. Step 6: Credentials Management For security, retrieve database credentials from AWS Secrets Manager while executing DMS Terraform/Terragrunt modules. Step 7: Endpoint Creation Establish endpoints for target and source instances, facilitating seamless connection and data transfer. Step 8: Replication Instances Create replication instances based on the database count or quota limits. Step 9: CloudWatch Integration Configure AWS CloudWatch events to trigger a Lambda function after AWS DMS replication tasks are created. Step 10: Replication Task Configuration Create replication tasks for individual databases and assign them to available replication instances for optimized data transfer. Step 11: Task Automation Replication tasks automatically start using the Lambda function in the Ready State. Step 12: Monitoring Migration Use the AWS DMS Console for real-time monitoring of data migration progress, gaining insights into the migration journey. Step 13: Ongoing Changes Seamlessly replicate ongoing changes into the target instance after the migration, ensuring data consistency. Step 14: Automated Validation Automatically validate migrated data against source and target instances based on provided validation configurations to reinforce data integrity. Step 15: Completion and Configuration Ensure user migration and database configurations are completed post-validation. Step 16: Target Testing and Validation Update the application configuration to use the target instance for testing to ensure functionality. Step 17: Cutover Replication Execute cutover replication from the source instance after thorough testing, taking a final snapshot of the source instance to conclude the process. Key Features and Benefits of AWS DMS With Terraform AWS DMS with Terraform IAC offers several benefits: cost-efficiency, ease of use, minimized downtime, and robust replication. Cost Optimization AWS DMS Migration offers a cost-effective model as it costs as per compute resources and additional log storage. Ease of Use The migration process is simplified with no need for specific drivers or application installations and often no changes to the source database. One-click resource creation streamlines the entire migration journey. Continuous Replication and Minimal Downtime AWS DMS ensures continuous source database replication, even while operational, enabling minimal downtime and seamless database switching. Ongoing Replication Maintaining synchronization between source and target databases with ongoing replication tasks ensures data consistency. Diverse Source/Target Support AWS DMS supports migrations from like-to-like (e.g., MySQL to MySQL) to heterogeneous migrations (e.g., Oracle to Amazon Aurora) across SQL, NoSQL, and text-based targets. Database Consolidation AWS DMS with Terraform can easily consolidate multiple source databases into a single target database, which applies to homogeneous and heterogeneous migrations. Efficiency in Schema Conversion and Migration AWS DMS minimizes manual effort in tasks such as migrating users, stored procedures, triggers, and schema conversion while validating the target database against application functionality. Automated Provisioning With Terraform IAC Leverage Terraform for automated creation and destruction of AWS DMS replication tasks, ideal for managing migrations involving multiple databases. Automated Pipeline Integration Integrate seamlessly with CI/CD pipelines for efficient migration management, monitoring, and progress tracking. Conclusion This blog talks in detail about how the combination of AWS DMS and Terraform IAC can be used to automate data migration. The blog serves as a guide, exploring the synergy between these technologies and equipping businesses with the tools for optimized digital transformation.

By Sameer Danave
AWS: EC2 User Data vs. EC2 AMI
AWS: EC2 User Data vs. EC2 AMI

In this blog on AWS, I will do a comparison study among two EC2 initialization/configuration tools — User Data and AMI, which help in the configuration and management of EC2 instances. EC2 User Data EC2 User Data is a powerful feature of EC2 instances that allows you to automate tasks and customize your instances during the bootstrapping process. It’s a versatile tool that can be used to install software, configure instances, and even perform complex setup tasks. User Data refers to data that is provided by the user when launching an instance. This data is generally used to perform automated configuration tasks and bootstrap scripts when the instance boots for the first time. Purpose To automate configuration tasks and software installations when an instance is launched. Key Features Automation of Initial Configuration It can include scripts (e.g., shell scripts), commands, or software installation instructions. Runs on First Boot Executes only once during the initial boot (first start) of the instance unless specified otherwise. Use Cases Initialization Tasks Set up environment variables, download and install software packages, configure services, and more when the instance starts. One-Time Setup Run scripts that should only be executed once at the instance’s first boot. Dynamic Configurations Apply configurations that might change frequently and are specific to each instance launch. EC2 AMI An Amazon Machine Image (AMI) is a master image for the creation of EC2 instances. It is a template that contains a software configuration (operating system, application server, and applications) necessary to launch an EC2 instance. You can create your own AMI or use pre-built ones provided by AWS or AWS Marketplace vendors. Purpose To provide a consistent and repeatable environment for launching instances. Key Features Pre-Configured Environment Includes everything needed to boot the instance, including the operating system and installed applications. Reusable and Shareable Once created, an AMI can be used to launch multiple instances, shared with other AWS accounts, or even made public. Use Cases Base Images Create standardized base images with all necessary configurations and software pre-installed. Consistency Ensure that all instances launched from the same AMI have identical configurations. Faster Deployments Launch instances faster since the AMI already includes the required software and configurations. Key Differences Scripting vs. Pre-Configured User Data allows you to run a script when you launch an instance, automating tasks like installing software, writing files, or otherwise configuring the new instance. AMIs contain a snapshot of a configured instance, meaning all the software and settings are preserved. Dynamic Configuration vs. Quick Launch User Data is a flexible way to handle the instance configuration dynamically at the time of instance launch. Using an AMI that has software pre-installed can speed up instance deployment. Uniformity vs. Immutable With User Data, you can use a single AMI for all your instances and customize each instance on launch. AMIs are immutable, so each instance launched from the AMI has the same configuration. Late Binding vs. Early Binding Changes to User Data can be made at any time prior to instance launch, giving you more flexibility to adjust your instance’s behavior. Since the AMI is pre-configured, changes to the instance configuration must be made by creating a new AMI ONLY. Stateless vs. Stateful User Data is generally designed to be stateless, meaning the configuration is specified each time you launch a new instance and it is not saved with the instance. Once an AMI is created, it represents the saved state of an instance. This can include installed software, system settings, and even data. Resource Intensive vs. Resource Efficient With User Data, running complex scripts can be resource-intensive and can delay the time it takes for an instance to become fully operational. Since, in AMI, everything is pre-configured, fewer startup resources are needed. Size Limitation vs. No Size Limitation User Data is limited to 16KB. There are no specific size limitations for AMIs, other than the size of the EBS volume or instance storage. Security Sensitive data in User Data should be handled carefully as it’s visible in the EC2 console and through the API. AMIs can be encrypted, and access can be restricted to specific AWS accounts. However, once an AMI is launched, its settings and data are exposed to the account that owns the instance. Troubleshooting Errors in User Data scripts can sometimes be difficult to troubleshoot, especially if they prevent the instance from starting correctly. Errors in AMIs are easier to troubleshoot since you can start and stop instances, taking snapshots at various states for analysis. Commonalities Instance Initialization and Configuration Both User Data and AMIs are used to configure EC2 instances. User Data allows for dynamic script execution at boot time, while AMIs provide a snapshot of a pre-configured system state, including the operating system and installed applications. Automation Both tools enhance the automation capabilities of AWS EC2. User Data automates the process of setting up and configuring a new instance at launch, whereas AMIs automate the deployment of new instances by providing a consistent, repeatable template for instance creation. Scalability User Data and AMIs both support scalable deployment strategies. User Data can be used to configure instances differently based on their role or purpose as they are launched, adapting to scalable environments. AMIs allow for the rapid scaling of applications by launching multiple identical instances quickly and efficiently. Customization Both provide mechanisms for customizing EC2 instances. With User Data, users can write scripts that apply custom configurations every time an instance is launched. With AMIs, users can create a customized image that includes all desired configurations and software, which can be reused across multiple instance launches. Integration With AWS Services Both integrate seamlessly with other AWS services. For example, both can be utilized alongside AWS Auto Scaling to ensure that new instances are configured properly as they enter the service pool. They also work with AWS Elastic Load Balancing to distribute traffic to instances that are either launched from a custom AMI or configured via User Data. Security and Compliance Both can be configured to adhere to security standards and compliance requirements. For AMIs, security configurations, software patches, and compliance settings can be pre-applied. For User Data, security scripts and configurations can be executed at launch to meet specific security or compliance criteria. Version Control and Updates In practice, both User Data and AMIs can be version-controlled. For User Data, scripts can be maintained in source control repositories and updated as needed. For AMIs, new versions can be created following updates or changes, allowing for rollback capabilities and history tracking. Conclusion In essence, while User Data is suited for dynamic and specific configurations at instance launch, AMIs provide a way to standardize and expedite deployments across multiple instances. This is just an attempt to clear out ambiguities between EC2 initialization/configuration tools — User Data and AMI. Hope you find this article helpful in understanding the two important EC2 Configuration tools of AWS. Thank you for reading!! Please don’t forget to like, share, and also feel free to share your thoughts in the comments section.

By PRAVEEN SUNDAR
Top 10 Essential Linux Commands
Top 10 Essential Linux Commands

As a Linux administrator or even if you are a newbie who just started using Linux, having a good understanding of useful commands in troubleshooting network issues is paramount. We'll explore the top 10 essential Linux commands for diagnosing and resolving common network problems. Each command will be accompanied by real-world examples to illustrate its usage and effectiveness. 1. ping Example: ping google.com Shell test@ubuntu-server ~ % ping google.com -c 5 PING google.com (142.250.189.206): 56 data bytes 64 bytes from 142.250.189.206: icmp_seq=0 ttl=58 time=14.610 ms 64 bytes from 142.250.189.206: icmp_seq=1 ttl=58 time=18.005 ms 64 bytes from 142.250.189.206: icmp_seq=2 ttl=58 time=19.402 ms 64 bytes from 142.250.189.206: icmp_seq=3 ttl=58 time=22.450 ms 64 bytes from 142.250.189.206: icmp_seq=4 ttl=58 time=15.870 ms --- google.com ping statistics --- 5 packets transmitted, 5 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 14.610/18.067/22.450/2.749 ms test@ubuntu-server ~ % Explanation ping uses ICMP protocol, where ICMP stands for internet control message protocol and ICMP is a network layer protocol used by network devices to communicate. ping helps in testing the reachability of the host and it will also help in finding the latency between the source and destination. 2. traceroute Example: traceroute google.com Shell test@ubuntu-server ~ % traceroute google.com traceroute to google.com (142.250.189.238), 64 hops max, 52 byte packets 1 10.0.0.1 (10.0.0.1) 6.482 ms 3.309 ms 3.685 ms 2 96.120.90.197 (96.120.90.197) 13.094 ms 10.617 ms 11.351 ms 3 po-301-1221-rur01.fremont.ca.sfba.comcast.net (68.86.248.153) 12.627 ms 11.240 ms 12.020 ms 4 ae-236-rar01.santaclara.ca.sfba.comcast.net (162.151.87.245) 18.902 ms 44.432 ms 18.269 ms 5 be-299-ar01.santaclara.ca.sfba.comcast.net (68.86.143.93) 14.826 ms 13.161 ms 12.814 ms 6 69.241.75.42 (69.241.75.42) 12.236 ms 12.302 ms 69.241.75.46 (69.241.75.46) 15.215 ms 7 * * * 8 142.251.65.166 (142.251.65.166) 21.878 ms 14.087 ms 209.85.243.112 (209.85.243.112) 14.252 ms 9 nuq04s39-in-f14.1e100.net (142.250.189.238) 13.666 ms 192.178.87.152 (192.178.87.152) 12.657 ms 13.170 ms test@ubuntu-server ~ % Explanation Traceroute shows the route packets take to reach a destination host. It displays the IP addresses of routers along the path and calculates the round-trip time (RTT) for each hop. Traceroute helps identify network congestion or routing issues. 3. netstat Example: netstat -tulpn Shell test@ubuntu-server ~ % netstat -tuln Active LOCAL (UNIX) domain sockets Address Type Recv-Q Send-Q Inode Conn Refs Nextref Addr aaf06ba76e4d0469 stream 0 0 0 aaf06ba76e4d03a1 0 0 /var/run/mDNSResponder aaf06ba76e4d03a1 stream 0 0 0 aaf06ba76e4d0469 0 0 aaf06ba76e4cd4c1 stream 0 0 0 aaf06ba76e4ccdb9 0 0 /var/run/mDNSResponder aaf06ba76e4cace9 stream 0 0 0 aaf06ba76e4c9e11 0 0 /var/run/mDNSResponder aaf06ba76e4d0b71 stream 0 0 0 aaf06ba76e4d0aa9 0 0 /var/run/mDNSResponder test@ubuntu-server ~ % Explanation Netstat displays network connections, routing tables, interface statistics, masquerade connections, and multicast memberships. It's useful for troubleshooting network connectivity, identifying open ports, and monitoring network performance. 4. ifconfig/ip Example: ifconfig or ifconfig <interface name> Shell test@ubuntu-server ~ % ifconfig en0 en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 options=6460<TSO4,TSO6,CHANNEL_IO,PARTIAL_CSUM,ZEROINVERT_CSUM> ether 10:9f:41:ad:91:60 inet 10.0.0.24 netmask 0xffffff00 broadcast 10.0.0.255 inet6 fe80::870:c909:df17:7ed1%en0 prefixlen 64 secured scopeid 0xc inet6 2601:641:300:e710:14ef:e605:4c8d:7e09 prefixlen 64 autoconf secured inet6 2601:641:300:e710:d5ec:a0a0:cdbb:79a7 prefixlen 64 autoconf temporary inet6 2601:641:300:e710::6cfc prefixlen 64 dynamic nd6 options=201<PERFORMNUD,DAD> media: autoselect status: active test@ubuntu-server ~ % Explanation ifconfig and ip commands are used to view and configure network parameters. They provide information about the IP address, subnet mask, MAC address, and network status of each interface. 5. tcpdump Example:tcpdump -i en0 tcp port 80 Shell test@ubuntu-server ~ % tcpdump -i en0 tcp port 80 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on en0, link-type EN10MB (Ethernet), snapshot length 524288 bytes 0 packets captured 55 packets received by filter 0 packets dropped by kernel test@ubuntu-server ~ % Explanation Tcpdump is a packet analyzer that captures and displays network traffic in real-time. It's invaluable for troubleshooting network issues, analyzing packet contents, and identifying abnormal network behavior. Use tcpdump to inspect packets on specific interfaces or ports. 6. nslookup/dig Example: nslookup google.com or dig Shell test@ubuntu-server ~ % nslookup google.com Server: 2001:558:feed::1 Address: 2001:558:feed::1#53 Non-authoritative answer: Name: google.com Address: 172.217.12.110 test@ubuntu-server ~ % test@ubuntu-server ~ % dig google.com ; <<>> DiG 9.10.6 <<>> google.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46600 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;google.com. IN A ;; ANSWER SECTION: google.com. 164 IN A 142.250.189.206 ;; Query time: 20 msec ;; SERVER: 2001:558:feed::1#53(2001:558:feed::1) ;; WHEN: Mon Apr 15 22:55:35 PDT 2024 ;; MSG SIZE rcvd: 55 test@ubuntu-server ~ % Explanation nslookup and dig are DNS lookup tools used to query DNS servers for domain name resolution. They provide information about the IP address associated with a domain name and help diagnose DNS-related problems such as incorrect DNS configuration or server unavailability. 7. iptables/firewalld Example: iptables -L or firewall-cmd --list-all Shell test@ubuntu-server ~# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy DROP) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination test@ubuntu-server ~# Explanation iptables and firewalld are firewall management tools used to configure packet filtering and network address translation (NAT) rules. They control incoming and outgoing traffic and protect the system from unauthorized access. Use them to diagnose firewall-related issues and ensure proper traffic flow. 8. ss Example: ss -tulpn Shell test@ubuntu-server ~# Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port udp UNCONN 0 0 *:161 *:* udp UNCONN 0 0 *:161 *:* test@ubuntu-server ~# Explanation ss is a utility to investigate sockets. It displays information about TCP, UDP, and UNIX domain sockets, including listening and established connections, connection state, and process IDs. ss is useful for troubleshooting socket-related problems and monitoring network activity. 9. arp Example: arp -a Shell test@ubuntu-server ~ % arp -a ? (10.0.0.1) at 80:da:c2:95:aa:f7 on en0 ifscope [ethernet] ? (10.0.0.57) at 1c:4d:66:bb:49:a on en0 ifscope [ethernet] ? (10.0.0.83) at 3a:4a:df:fe:66:58 on en0 ifscope [ethernet] ? (10.0.0.117) at 70:2a:d5:5a:cc:14 on en0 ifscope [ethernet] ? (10.0.0.127) at fe:e2:1c:4d:b3:f7 on en0 ifscope [ethernet] ? (10.0.0.132) at bc:d0:74:9a:51:85 on en0 ifscope [ethernet] ? (10.0.0.255) at ff:ff:ff:ff:ff:ff on en0 ifscope [ethernet] mdns.mcast.net (224.0.0.251) at 1:0:5e:0:0:fb on en0 ifscope permanent [ethernet] ? (239.255.255.250) at 1:0:5e:7f:ff:fa on en0 ifscope permanent [ethernet] test@ubuntu-server ~ % Explanation arp (Address Resolution Protocol) displays and modifies the IP-to-MAC address translation tables used by the kernel. It resolves IP addresses to MAC addresses and vice versa. arp is helpful for troubleshooting issues related to network device discovery and address resolution. 10. mtr Example: mtr Shell test.ubuntu.com (0.0.0.0) Tue Apr 16 14:46:40 2024 Keys: Help Display mode Restart statistics Order of fields quit Packets Ping Host Loss% Snt Last Avg Best Wrst StDev 1. 10.0.0.10 0.0% 143 0.8 9.4 0.7 58.6 15.2 2. 10.0.2.10 0.0% 143 0.8 9.4 0.7 58.6 15.2 3. 192.168.0.233 0.0% 143 0.8 9.4 0.7 58.6 15.2 4. 142.251.225.178 0.0% 143 0.8 9.4 0.7 58.6 15.2 5. 142.251.225.177 0.0% 143 0.8 9.4 0.7 58.6 15.2 Explanation mtr (My traceroute) combines the functionality of ping and traceroute into a single diagnostic tool. It continuously probes network paths between the host and a destination, displaying detailed statistics about packet loss, latency, and route changes. Mtr is ideal for diagnosing intermittent network problems and monitoring network performance over time. Mastering these commands comes in handy for troubleshooting network issues on Linux hosts.

By Prashanth Ravula DZone Core CORE
Build a Time-Tracking App With ClickUp API Integration Using Openkoda
Build a Time-Tracking App With ClickUp API Integration Using Openkoda

Is it possible to build a time-tracking app in just a few hours? It is, and in this article, I'll show you how! I’m a senior backend Java developer with 8 years of experience in building web applications. I will show you how satisfying and revolutionary it can be to save a lot of time on building my next one. The approach I use is as follows: I want to create a time-tracking application (I called it Timelog) that integrates with the ClickUp API. It offers a simple functionality that will be very useful here: creating time entries remotely. In order to save time, I will use some out-of-the-box functionalities that the Openkoda platform offers. These features are designed with developers in mind. Using them, I can skip building standard features that are used in every web application (over and over again). Instead, I can focus on the core business logic. I will use the following pre-built features for my application needs: Login/password authentication User and organization management Different user roles and privileges Email sender Logs overview Server-side code editor Web endpoints creator CRUDs generator Let’s get started! Timelog Application Overview Our sample internal application creates a small complex system that can then be easily extended both model-wise and with additional business logic or custom views. The main focus of the application is to: Store the data required to communicate with the ClickUp API. Assign users to their tickets. Post new time entries to the external API. To speed up the process of building the application, we relied on some of the out-of-the-box functionalities mentioned above. At this stage, we used the following ones: Data model builder (Form) - Allows us to define data structures without the need to recompile the application, with the ability to adjust the data schema on the fly Ready-to-use management functionalities - With this one, we can forget about developing things like authentication, security, and standard dashboard view. Server-side code editor - Used to develop a dedicated service responsible for ClickUp API integration, it is coded in JavaScript all within the Openkoda UI. WebEndpoint builder - Allows us to create a custom form handler that uses a server-side code service to post time tracking entry data to the ClickUp servers instead of storing it in our internal database Step 1: Setting Up the Architecture To implement the functionality described above and to store the required data, we designed a simple data model, consisting of the following five entities. ClickUpConfig, ClickUpUser, Ticket, and Assignment are designed to store the keys and IDs required for connections and messages sent to the ClickUp API. The last one, TimeEntry, is intended to take advantage of a ready-to-use HTML form (Thymeleaf fragment), saving a lot of time on its development. The following shows the detailed structure of a prepared data model for the Timelog ClickUp integration. ClickUpConfig apiKey - ClickUp API key teamId - ID of space in ClickUp to create time entry in ClickUpUser userId - Internal ID of a User clickUpUserId - ID of a user assigned to a workspace in ClickUp Ticket name - Internal name of the ticket clickUpTicketid - ID of a ticket in ClickUp to create time entries Assignment userId - Internal ID of a User ticketId - Internal ID of a Ticket TimeEntry userId - Internal ID of a User ticketId - Internal ID of a ticket date - Date of a time entry durationHours - Time entry duration provided in hours durationMinutes - Time entry duration provided in minutes description - Short description for created time entry We want to end up with five data tiles on the dashboard: Step 2: Integrating With ClickUp API We integrated our application with the ClickUp API specifically using its endpoint to create time entries in ClickUp. To connect the Timelog app with our ClickUp workspace, it is required to provide the API Key. This can be done using either a personal API token or a token generated by creating an App in the ClickUp dashboard. For information on how to retrieve one of these, see the official ClickUp documentation. In order for our application to be able to create time entries in our ClickUp workspace, we need to provide some ClickUp IDs: teamId: This is the first ID value in the URL after accessing your workspace. userId: To check the user’s ClickUp ID (Member ID), go to Workspace -> Manage Users. On the Users list, select the user’s Settings and then Copy Member ID. taskId: Task ID is accessible in three places on the dashboard: URL, task modal, and tasks list view. See the ClickUp Help Center for detailed instructions. You can recognize the task ID being prefixed by the # sign - we use the ID without the prefix. Step 3: Data Model Magic With Openkoda Openkoda uses the Byte Buddy library to dynamically build entity and repository classes for dynamically registered entities during the runtime of our Spring Boot application. Here is a short snippet of entity class generation in Openkoda (a whole service class is available on their GitHub). Java dynamicType = new ByteBuddy() .with(SKIP_DEFAULTS) .subclass(OpenkodaEntity.class) .name(PACKAGE + name) .annotateType(entity) .annotateType(tableAnnotation) .defineConstructor(PUBLIC) .intercept(MethodCall .invoke(OpenkodaEntity.class.getDeclaredConstructor(Long.class)) .with((Object) null)); Openkoda provides a custom form builder syntax that defines the structure of an entity. This structure is then used to generate both entity and repository classes, as well as HTML representations of CRUD views such as a paginated table with all records, a settings form, and a simple read-only view. All of the five entities from the data model described earlier have been registered in the same way, only by using the form builder syntax. The form builder snippet for the Ticket entity is presented below. JavaScript a => a .text("name") .text("clickUpTaskId") The definition above results in having the entity named Ticket with a set of default fields for OpenkodaEntity and two custom ones named “name” and “clickUpTaskId”. The database table structure for dynamically generated Ticket entity is as follows: Markdown Table "public.dynamic_ticket" Column | Type | Collation | Nullable | Default ------------------+--------------------------+-----------+----------+----------------------- id | bigint | | not null | created_by | character varying(255) | | | created_by_id | bigint | | | created_on | timestamp with time zone | | | CURRENT_TIMESTAMP index_string | character varying(16300) | | | ''::character varying modified_by | character varying(255) | | | modified_by_id | bigint | | | organization_id | bigint | | | updated_on | timestamp with time zone | | | CURRENT_TIMESTAMP click_up_task_id | character varying(255) | | | name | character varying(255) | | | The last step of a successful entity registration is to refresh the Spring context so it recognizes the new repository beans and for Hibernate to acknowledge entities. It can be done by restarting the application from the Admin Panel (section Monitoring). Our final result is an auto-generated full CRUD for the Ticket entity. Auto-generated Ticket settings view: Auto-generated all Tickets list view: Step 4: Setting Up Server-Side Code as a Service We implemented ClickUp API integration using the Openkoda Server-Side Code keeping API calls logic separate as a service. It is possible to use the exported JS functions further in the logic of custom form view request handlers. Then we created a JavaScript service that delivers functions responsible for ClickUp API communication. Openkoda uses GraalVM to run any JS code fully on the backend server. Our ClickupAPI server-side code service has only one function (postCreateTimeEntry) which is needed to meet our Timelog application requirements. JavaScript export function postCreateTimeEntry(apiKey, teamId, duration, description, date, assignee, taskId) { let url = `https://api.clickup.com/api/v2/team/${teamId}/time_entries`; let timeEntryReq = { duration: duration, description: '[Openkoda Timelog] ' + description, billable: true, start: date, assignee: assignee, tid: taskId, }; let headers = {Authorization: apiKey}; return context.services.integrations.restPost(url, timeEntryReq, headers); } To use such a service later on in WebEndpoints, it is easy enough to follow the standard JS import expression import * as clickupAPI from 'clickupAPI';. Step 5: Building Time Entry Form With Custom GET/POST Handlers Here, we prepare the essential screen for our demo application: the time entry form which posts data to the ClickUp API. All is done in the Openkoda user interface by providing simple HTML content and some JS code snippets. The View The HTML fragment is as simple as the one posted below. We used a ready-to-use form Thymeleaf fragment (see form tag) and the rest of the code is a standard structure of a Thymeleaf template. HTML <!--DEFAULT CONTENT--> <!DOCTYPE html> <html xmlns:th="http://www.thymeleaf.org" xmlns:layout="http://www.ultraq.net.nz/thymeleaf/layout" lang="en" layout:decorate="~{${defaultLayout}"> <body> <div class="container"> <h1 layout:fragment="title"/> <div layout:fragment="content"> <form th:replace="~{generic-forms::generic-form(${TimeEntry}, 'TimeEntry', '', '', '', 'Time Entry', #{template.save}, true)}"></form> </div> </div> </body> </html> HTTP Handlers Once having a simple HTML code for the view, we need to provide the actual form object required for the generic form fragment (${TimeEntry}). We do it inside a GET endpoint as a first step, and after that, we set the currently logged user ID so there’s a default value selected when entering the time entry view. JavaScript flow .thenSet("TimeEntry", a => a.services.data.getForm("TimeEntry")) .then(a => a.model.get("TimeEntry").dto.set("userId", a.model.get("userEntityId"))) Lastly, the POST endpoint is registered to handle the actual POST request sent from the form view (HTML code presented above). It implements the scenario where a user enters the time entry form, provides the data, and then sends the data to the ClickUp server. The following POST endpoint JS code: Receives the form data. Reads the additional configurations from the internal database (like API key, team ID, or ClickUp user ID). Prepares the data to be sent. Triggers the clickupAPI service to communicate with the remote API. JavaScript import * as clickupAPI from 'clickupAPI'; flow .thenSet("clickUpConfig", a => a.services.data.getRepository("clickupConfig").search( (root, query, cb) => { let orgId = a.model.get("organizationEntityId") != null ? a.model.get("organizationEntityId") : -1; return cb.or(cb.isNull(root.get("organizationId")), cb.equal(root.get("organizationId"), orgId)); }).get(0) ) .thenSet("clickUpUser", a => a.services.data.getRepository("clickupUser").search( (root, query, cb) => { let userId = a.model.get("userEntityId") != null ? a.model.get("userEntityId") : -1; return cb.equal(root.get("userId"), userId); }) ) .thenSet("ticket", a => a.form.dto.get("ticketId") != null ? a.services.data.getRepository("ticket").findOne(a.form.dto.get("ticketId")) : null) .then(a => { let durationMs = (a.form.dto.get("durationHours") != null ? a.form.dto.get("durationHours") * 3600000 : 0) + (a.form.dto.get("durationMinutes") != null ? a.form.dto.get("durationMinutes") * 60000 : 0); return clickupAPI.postCreateTimeEntry( a.model.get("clickUpConfig").apiKey, a.model.get("clickUpConfig").teamId, durationMs, a.form.dto.get("description"), a.form.dto.get("date") != null ? (new Date(a.services.util.toString(a.form.dto.get("date")))).getTime() : Date.now().getTime(), a.model.get("clickUpUser").length ? a.model.get("clickUpUser").get(0).clickUpUserId : -1, a.model.get("ticket") != null ? a.model.get("ticket").clickUpTaskId : '') }) Step 6: Our Application Is Ready! This is it! I built a complex application that is capable of storing the data of users, assignments to their tickets, and any properties required for ClickUp API connection. It provides a Time Entry form that covers ticket selection, date, duration, and description inputs of a single time entry and sends the data from the form straight to the integrated API. Not to forget about all of the pre-built functionalities available in Openkoda like authentication, user accounts management, logs overview, etc. As a result, the total time to create the Timelog application was only a few hours. What I have built is just a simple app with one main functionality. But there are many ways to extend it, e.g., by adding new structures to the data model, by developing more of the ClickUp API integration, or by creating more complex screens like the calendar view below. If you follow almost exactly the same scenario as I presented in this case, you will be able to build any other simple (or not) business application, saving time on repetitive and boring features and focusing on the core business requirements. I can think of several applications that could be built in the same way, such as a legal document management system, a real estate application, a travel agency system, just to name a few. As an experienced software engineer, I always enjoy implementing new ideas and seeing the results quickly. In this case, that is all I did. I spent the least amount of time creating a fully functional application tailored to my needs and skipped the monotonous work. The .zip package with all code and configuration files are available on my GitHub.

By Martyna Szczepanska
7 Linux Commands and Tips to Improve Productivity
7 Linux Commands and Tips to Improve Productivity

1. Use "&&" to Link Two or More Commands Use “&&” to link two or more commands when you want the previous command to be succeeded before the next command. If you use “;” then it would still run the next command after “;” even if the command before “;” failed. So you would have to wait and run each command one by one. However, using "&&" ensures that the next command will only run if the preceding command finishes successfully. This allows you to add commands without waiting, move on to the next task, and check later. If the last command ran, it indicates that all previous commands ran successfully. Example: Shell ls /path/to/file.txt && cp /path/to/file.txt /backup/ The above example ensures that the previous command runs successfully and that the file "file.txt" exists. If the file doesn't exist, the second command after "&&" won't run and won't attempt to copy it. 2. Use “grep” With -A and -B Options One common use of the "grep" command is to identify specific errors from log files. However, using it with the -A and -B options provides additional context within a single command, and it displays lines after and before the searched text, which enhances visibility into related content. Example: Shell % grep -A 2 "java.io.IOException" logfile.txt java.io.IOException: Permission denied (open /path/to/file.txt) at java.io.FileOutputStream.<init>(FileOutputStream.java:53) at com.pkg.TestClass.writeFile(TestClass.java:258) Using grep with -A here will also show 2 lines after the “java.io.IOException” was found from the logfile.txt. Similarly, Shell grep "Ramesh" -B 3 rank-file.txt Name: John Wright, Rank: 23 Name: David Ross, Rank: 45 Name: Peter Taylor, Rank: 68 Name Ramesh Kumar, Rank: 36 Here, grep with -B option will also show 3 lines before the “Ramesh” was found from the rank-file.txt 3. Use “>” to Create an Empty File Just write > and then the filename to create an empty file with the name provided after > Example: Shell >my-file.txt It will create an empty file with "my-file.txt" name in the current directory. 4. Use “rsync” for Backups "rsync" is a useful command for regular backups as it saves time by transferring only the differences between the source and destination. This feature is especially beneficial when creating backups over a network. Example: Shell rsync -avz /path/to/source_directory/ user@remotehost:/path/to/destination_directory/ 5. Use Tab Completion Using tab completion as a habit is faster than manually selecting filenames and pressing Enter. Typing the initial letters of filenames and utilizing Tab completion streamlines the process and is more efficient. 6. Use “man” Pages Instead of reaching the web to find the usage of a command, a quick way would be to use the “man” command to find out the manual of that command. This approach not only saves time but also ensures accuracy, as command options can vary based on the installed version. By accessing the manual directly, you get precise details tailored to your existing version. Example: Shell man ps It will get the manual page for the “ps” command 7. Create Scripts For repetitive tasks, create small shell scripts that chain commands and perform actions based on conditions. This saves time and reduces risks in complex operations. Conclusion In conclusion, becoming familiar with these Linux commands and tips can significantly boost productivity and streamline workflow on the command line. By using techniques like command chaining, context-aware searching, efficient file management, and automation through scripts, users can save time, reduce errors, and optimize their Linux experience.

By Rahul Chaturvedi
Deep Dive Into Terraform Provider Debugging With Delve
Deep Dive Into Terraform Provider Debugging With Delve

Debugging Terraform providers is crucial for ensuring the reliability and functionality of infrastructure deployments. Terraform providers, written in languages like Go, can have complex logic that requires careful debugging when issues arise. One powerful tool for debugging Terraform providers is Delve, a debugger for the Go programming language. Delve allows developers to set breakpoints, inspect variables, and step through code, making it easier to identify and resolve bugs. In this blog, we will explore how to use Delve effectively for debugging Terraform providers. Setup Delve for Debugging Terraform Provider Shell # For Linux sudo apt-get install -y delve # For macOS brew instal delve Refer here for more details on the installation. Debug Terraform Provider Using VS Code Follow the below steps to debug the provider Download the provider code. We will use IBM Cloud Terraform Provider for this debugging example. Update the provider’s main.go code to the below to support debugging Go package main import ( "flag" "log" "github.com/IBM-Cloud/terraform-provider-ibm/ibm/provider" "github.com/IBM-Cloud/terraform-provider-ibm/version" "github.com/hashicorp/terraform-plugin-sdk/v2/plugin" ) func main() { var debug bool flag.BoolVar(&debug, "debug", true, "Set to true to enable debugging mode using delve") flag.Parse() opts := &plugin.ServeOpts{ Debug: debug, ProviderAddr: "registry.terraform.io/IBM-Cloud/ibm", ProviderFunc: provider.Provider, } log.Println("IBM Cloud Provider version", version.Version) plugin.Serve(opts) } Launch VS Code in debug mode. Refer here if you are new to debugging in VS Code. Create the launch.json using the below configuration. JSON { "version": "0.2.0", "configurations": [ { "name": "Debug Terraform Provider IBM with Delve", "type": "go", "request": "launch", "mode": "debug", "program": "${workspaceFolder}", "internalConsoleOptions": "openOnSessionStart", "args": [ "-debug" ] } ] } In VS Code click “Start Debugging”. Starting the debugging starts the provider for debugging. To attach the Terraform CLI to the debugger, console prints the environment variable TF_REATTACH_PROVIDERS. Copy this from the console. Set this as an environment variable in the terminal running the Terraform code. Now in the VS Code where the provider code is in debug mode, open the go code to set up break points. To know more on breakpoints in VS Code refer here. Execute 'terraform plan' followed by 'terraform apply', to notice the Terraform provider breakpoint to be triggered as part of the terraform apply execution. This helps to debug the Terraform execution and comprehend the behavior of the provider code for the particular inputs supplied in Terraform. Debug Terraform Provider Using DLV Command Line Follow the below steps to debug the provider using the command line. To know more about the dlv command line commands refer here. Follow the 1& 2 steps mentioned in Debug Terraform provider using VS Code In the terminal navigate to the provider go code and issue go build -gcflags="all=-N -l" to compile the code To execute the precompiled Terraform provider binary and begin a debug session, run dlv exec --accept-multiclient --continue --headless <path to the binary> -- -debug where the build file is present. For IBM Cloud Terraform provider use dlv exec --accept-multiclient --continue --headless ./terraform-provider-ibm -- -debug In another terminal where the Terraform code would be run, set the TF_REATTACH_PROVIDERS as an environment variable. Notice the “API server” details in the above command output. In another (third) terminal connect to the DLV server and start issuing the DLV client commands Set the breakpoint using the break command Now we are set to debug the Terraform provider when Terraform scripts are executed. Issue continue in the DLV client terminal to continue until the breakpoints are set. Now execute the terraform plan and terraform apply to notice the client waiting on the breakpoint. Use DLV CLI commands to stepin / stepout / continue the execution. This provides a way to debug the terraform provider from the command line. Remote Debugging and CI/CD Pipeline Debugging Following are the extensions to the debugging using the dlv command line tool. Remote Debugging Remote debugging allows you to debug a Terraform provider running on a remote machine or environment. Debugging in CI/CD Pipelines Debugging in CI/CD pipelines involves setting up your pipeline to run Delve and attach to your Terraform provider for debugging. This can be challenging due to the ephemeral nature of CI/CD environments. One approach is to use conditional logic in your pipeline configuration to only enable debugging when a specific environment variable is set. For example, you can use the following script in your pipeline configuration to start Delve and attach to your Terraform provider – YAML - name: Debug Terraform Provider if: env(DEBUG) == 'true' run: | dlv debug --headless --listen=:2345 --api-version=2 & sleep 5 # Wait for Delve to start export TF_LOG=TRACE terraform init terraform apply Best Practices for Effective Debugging With Delve Here are some best practices for effective debugging with Delve, along with tips for improving efficiency and minimizing downtime: Use version control: Always work with version-controlled code. This allows you to easily revert changes if debugging introduces new issues. Start small: Begin debugging with a minimal, reproducible test case. This helps isolate the problem and reduces the complexity of debugging. Understand the code: Familiarize yourself with the codebase before debugging. Knowing the code structure and expected behavior can speed up the debugging process. Use logging: Add logging statements to your code to track the flow of execution and the values of important variables. This can provide valuable insights during debugging. Use breakpoints wisely: Set breakpoints strategically at critical points in your code. Too many breakpoints can slow down the debugging process. Inspect variables: Use the print (p) command in Delve to inspect the values of variables. This can help you understand the state of your program at different points in time. Use conditional breakpoints: Use conditional breakpoints to break execution only when certain conditions are met. This can help you focus on specific scenarios or issues. Use stack traces: Use the stack command in Delve to view the call stack. This can help you understand the sequence of function calls leading to an issue. Use goroutine debugging: If your code uses goroutines, use Delve's goroutine debugging features to track down issues related to concurrency. Automate debugging: If you're debugging in a CI/CD pipeline, automate the process as much as possible to minimize downtime and speed up resolution. By following these best practices, you can improve the efficiency of your debugging process and minimize downtime caused by issues in your code. Conclusion In conclusion, mastering the art of debugging Terraform providers with Delve is a valuable skill that can significantly improve the reliability and performance of your infrastructure deployments. By setting up Delve for debugging, exploring advanced techniques like remote debugging and CI/CD pipeline debugging, and following best practices for effective debugging, you can effectively troubleshoot issues in your Terraform provider code. Debugging is not just about fixing bugs; it's also about understanding your code better and improving its overall quality. Dive deep into Terraform provider debugging with Delve, and empower yourself to build a more robust and efficient infrastructure with Terraform.

By Josephine Eskaline Joyce DZone Core CORE
Using Spring AI With AI/LLMs to Query Relational Databases
Using Spring AI With AI/LLMs to Query Relational Databases

The AIDocumentLibraryChat project has been extended to support questions for searching relational databases. The user can input a question and then the embeddings search for relevant database tables and columns to answer the question. Then the AI/LLM gets the database schemas of the relevant tables and generates based on the found tables and columns a SQL query to answer the question with a result table. Dataset and Metadata The open-source dataset that is used has 6 tables with relations to each other. It contains data about museums and works of art. To get useful queries of the questions, the dataset has to be supplied with metadata and that metadata has to be turned in embeddings. To enable the AI/LLM to find the needed tables and columns, it needs to know their names and descriptions. For all datatables like the museum table, metadata is stored in the column_metadata and table_metadata tables. Their data can be found in the files: column_metadata.csv and table_metadata.csv. They contain a unique ID, the name, the description, etc. of the table or column. That description is used to create the embeddings the question embeddings are compared with. The quality of the description makes a big difference in the results because the embedding is more precise with a better description. Providing synonyms is one option to improve the quality. The Table Metadata contains the schema of the table to add only the relevant table schemas to the AI/LLM prompt. Embeddings To store the embeddings in Postgresql, the vector extension is used. The embeddings can be created with the OpenAI endpoint or with the ONNX library that is provided by Spring AI. Three types of embeddings are created: Tabledescription embeddings Columndescription embeddings Rowcolumn embeddings The Tabledescription embeddings have a vector based on the table description and the embedding has the tablename, the datatype = table, and the metadata id in the metadata. The Columndescription embeddings have a vector based on the column description and the embedding has the tablename, the dataname with the column name, the datatype = column, and the metadata id in the metadata. The Rowcolumn embeddings have a vector based on the content row column value. That is used for the style or subject of an artwork to be able to use the values in the question. The metadata has the datatype = row, the column name as dataname, the tablename, and the metadata id. Implement the Search The search has 3 steps: Retrieve the embeddings Create the prompt Execute query and return result Retrieve the Embeddings To read the embeddings from the Postgresql database with the vector extension, Spring AI uses the VectorStore class in the DocumentVSRepositoryBean: Java @Override public List<Document> retrieve(String query, DataType dataType) { return this.vectorStore.similaritySearch( SearchRequest.query(query).withFilterExpression( new Filter.Expression(ExpressionType.EQ, new Key(MetaData.DATATYPE), new Value(dataType.toString())))); } The VectorStore provides a similarity search for the query of the user. The query is turned in an embedding and with the FilterExpression for the datatype in the header values, the results are returned. The TableService class uses the repository in the retrieveEmbeddings method: Java private EmbeddingContainer retrieveEmbeddings(SearchDto searchDto) { var tableDocuments = this.documentVsRepository.retrieve( searchDto.getSearchString(), MetaData.DataType.TABLE, searchDto.getResultAmount()); var columnDocuments = this.documentVsRepository.retrieve( searchDto.getSearchString(), MetaData.DataType.COLUMN, searchDto.getResultAmount()); List<String> rowSearchStrs = new ArrayList<>(); if(searchDto.getSearchString().split("[ -.;,]").length > 5) { var tokens = List.of(searchDto.getSearchString() .split("[ -.;,]")); for(int i = 0;i<tokens.size();i = i+3) { rowSearchStrs.add(tokens.size() <= i + 3 ? "" : tokens.subList(i, tokens.size() >= i +6 ? i+6 : tokens.size()).stream().collect(Collectors.joining(" "))); } } var rowDocuments = rowSearchStrs.stream().filter(myStr -> !myStr.isBlank()) .flatMap(myStr -> this.documentVsRepository.retrieve(myStr, MetaData.DataType.ROW, searchDto.getResultAmount()).stream()) .toList(); return new EmbeddingContainer(tableDocuments, columnDocuments, rowDocuments); } First, documentVsRepository is used to retrieve the document with the embeddings for the tables/columns based on the search string of the user. Then, the search string is split into chunks of 6 words to search for the documents with the row embeddings. The row embeddings are just one word, and to get a low distance, the query string has to be short; otherwise, the distance grows due to all the other words in the query. Then the chunks are used to retrieve the row documents with the embeddings. Create the Prompt The prompt is created in the TableService class with the createPrompt method: Java private Prompt createPrompt(SearchDto searchDto, EmbeddingContainer documentContainer) { final Float minRowDistance = documentContainer.rowDocuments().stream() .map(myDoc -> (Float) myDoc.getMetadata().getOrDefault(MetaData.DISTANCE, 1.0f)).sorted().findFirst().orElse(1.0f); LOGGER.info("MinRowDistance: {}", minRowDistance); var sortedRowDocs = documentContainer.rowDocuments().stream() .sorted(this.compareDistance()).toList(); var tableColumnNames = this.createTableColumnNames(documentContainer); List<TableNameSchema> tableRecords = this.tableMetadataRepository .findByTableNameIn(tableColumnNames.tableNames()).stream() .map(tableMetaData -> new TableNameSchema(tableMetaData.getTableName(), tableMetaData.getTableDdl())).collect(Collectors.toList()); final AtomicReference<String> joinColumn = new AtomicReference<String>(""); final AtomicReference<String> joinTable = new AtomicReference<String>(""); final AtomicReference<String> columnValue = new AtomicReference<String>(""); sortedRowDocs.stream().filter(myDoc -> minRowDistance <= MAX_ROW_DISTANCE) .filter(myRowDoc -> tableRecords.stream().filter(myRecord -> myRecord.name().equals(myRowDoc.getMetadata() .get(MetaData.TABLE_NAME))).findFirst().isEmpty()) .findFirst().ifPresent(myRowDoc -> { joinTable.set(((String) myRowDoc.getMetadata() .get(MetaData.TABLE_NAME))); joinColumn.set(((String) myRowDoc.getMetadata() .get(MetaData.DATANAME))); tableColumnNames.columnNames().add(((String) myRowDoc.getMetadata() .get(MetaData.DATANAME))); columnValue.set(myRowDoc.getContent()); this.tableMetadataRepository.findByTableNameIn( List.of(((String) myRowDoc.getMetadata().get(MetaData.TABLE_NAME)))) .stream().map(myTableMetadata -> new TableNameSchema( myTableMetadata.getTableName(), myTableMetadata.getTableDdl())).findFirst() .ifPresent(myRecord -> tableRecords.add(myRecord)); }); var messages = createMessages(searchDto, minRowDistance, tableColumnNames, tableRecords, joinColumn, joinTable, columnValue); Prompt prompt = new Prompt(messages); return prompt; } First, the min distance of the rowDocuments is filtered out. Then a list row of documents sorted by distance is created. The method createTableColumnNames(...) creates the tableColumnNames record that contains a set of column names and a list of table names. The tableColumnNames record is created by first filtering for the 3 tables with the lowest distances. Then the columns of these tables with the lowest distances are filtered out. Then the tableRecords are created by mapping the table names to the schema DDL strings with the TableMetadataRepository. Then the sorted row documents are filtered for MAX_ROW_DISTANCE and the values joinColumn, joinTable, and columnValue are set. Then the TableMetadataRepository is used to create a TableNameSchema and add it to the tableRecords. Now the placeholders in systemPrompt and the optional columnMatch can be set: Java private final String systemPrompt = """ ... Include these columns in the query: {columns} \n Only use the following tables: {schemas};\n %s \n """; private final String columnMatch = """ Join this column: {joinColumn} of this table: {joinTable} where the column has this value: {columnValue}\n """; The method createMessages(...) gets the set of columns to replace the {columns} placeholder. It gets tableRecords to replace the {schemas} placeholder with the DDLs of the tables. If the row distance was beneath the threshold, the property columnMatch is added at the string placeholder %s. Then the placeholders {joinColumn}, {joinTable}, and {columnValue} are replaced. With the information about the required columns the schemas of the tables with the columns and the information of the optional join for row matches, the AI/LLM is able to create a sensible SQL query. Execute Query and Return Result The query is executed in the createQuery(...) method: Java public SqlRowSet searchTables(SearchDto searchDto) { EmbeddingContainer documentContainer = this.retrieveEmbeddings(searchDto); Prompt prompt = createPrompt(searchDto, documentContainer); String sqlQuery = createQuery(prompt); LOGGER.info("Sql query: {}", sqlQuery); SqlRowSet rowSet = this.jdbcTemplate.queryForRowSet(sqlQuery); return rowSet; } First, the methods to prepare the data and create the SQL query are called and then queryForRowSet(...) is used to execute the query on the database. The SqlRowSet is returned. The TableMapper class uses the map(...) method to turn the result into the TableSearchDto class: Java public TableSearchDto map(SqlRowSet rowSet, String question) { List<Map<String, String>> result = new ArrayList<>(); while (rowSet.next()) { final AtomicInteger atomicIndex = new AtomicInteger(1); Map<String, String> myRow = List.of(rowSet .getMetaData().getColumnNames()).stream() .map(myCol -> Map.entry( this.createPropertyName(myCol, rowSet, atomicIndex), Optional.ofNullable(rowSet.getObject( atomicIndex.get())) .map(myOb -> myOb.toString()).orElse(""))) .peek(x -> atomicIndex.set(atomicIndex.get() + 1)) .collect(Collectors.toMap(myEntry -> myEntry.getKey(), myEntry -> myEntry.getValue())); result.add(myRow); } return new TableSearchDto(question, result, 100); } First, the result list for the result maps is created. Then, rowSet is iterated for each row to create a map of the column names as keys and the column values as values. This enables returning a flexible amount of columns with their results. createPropertyName(...) adds the index integer to the map key to support duplicate key names. Summary Backend Spring AI supports creating prompts with a flexible amount of placeholders very well. Creating the embeddings and querying the vector table is also very well supported. Getting reasonable query results needs the metadata that has to be provided for the columns and tables. Creating good metadata is an effort that scales linearly with the amount of columns and tables. Implementing the embeddings for columns that need them is an additional effort. The result is that an AI/LLM like OpenAI or Ollama with the "sqlcoder:70b-alpha-q6_K" model can answer questions like: "Show the artwork name and the name of the museum that has the style Realism and the subject of Portraits." The AI/LLM can within boundaries answer natural language questions that have some fit with the metadata. The amount of embeddings needed is too big for a free OpenAI account and the "sqlcoder:70b-alpha-q6_K" is the smallest model with reasonable results. AI/LLM offers a new way to interact with relational databases. Before starting a project to provide a natural language interface for a database, the effort and the expected results have to be considered. The AI/LLM can help with questions of small to middle complexity and the user should have some knowledge about the database. Frontend The returned result of the backend is a list of maps with keys as column names and values column values. The amount of returned map entries is unknown, because of that the table to display the result has to support a flexible amount of columns. An example JSON result looks like this: JSON {"question":"...","resultList":[{"1_name":"Portrait of Margaret in Skating Costume","2_name":"Philadelphia Museum of Art"},{"1_name":"Portrait of Mary Adeline Williams","2_name":"Philadelphia Museum of Art"},{"1_name":"Portrait of a Little Girl","2_name":"Philadelphia Museum of Art"}],"resultAmount":100} The resultList property contains a JavaScript array of objects with property keys and values. To be able to display the column names and values in an Angular Material Table component, these properties are used: TypeScript protected columnData: Map<string, string>[] = []; protected columnNames = new Set<string>(); The method getColumnNames(...) of the table-search.component.ts is used to turn the JSON result in the properties: TypeScript private getColumnNames(tableSearch: TableSearch): Set<string> { const result = new Set<string>(); this.columnData = []; const myList = !tableSearch?.resultList ? [] : tableSearch.resultList; myList.forEach((value) => { const myMap = new Map<string, string>(); Object.entries(value).forEach((entry) => { result.add(entry[0]); myMap.set(entry[0], entry[1]); }); this.columnData.push(myMap); }); return result; } First, the result set is created and the columnData property is set to an empty array. Then, myList is created and iterated with forEach(...). For each of the objects in the resultList, a new Map is created. For each property of the object, a new entry is created with the property name as the key and the property value as the value. The entry is set on the columnData map and the property name is added to the result set. The completed map is pushed on the columnData array and the result is returned and set to the columnNames property. Then a set of column names is available in the columnNames set and a map with column name to column value is available in the columnData. The template table-search.component.html contains the material table: HTML @if(searchResult && searchResult.resultList?.length) { <table mat-table [dataSource]="columnData"> <ng-container *ngFor="let disCol of columnNames" matColumnDef="{{ disCol }"> <th mat-header-cell *matHeaderCellDef>{{ disCol }</th> <td mat-cell *matCellDef="let element">{{ element.get(disCol) }</td> </ng-container> <tr mat-header-row *matHeaderRowDef="columnNames"></tr> <tr mat-row *matRowDef="let row; columns: columnNames"></tr> </table> } First, the searchResult is checked for existence and objects in the resultList. Then, the table is created with the datasource of the columnData map. The table header row is set with <tr mat-header-row *matHeaderRowDef="columnNames"></tr> to contain the columnNames. The table rows and columns are defined with <tr mat-row *matRowDef="let row; columns: columnNames"></tr>. The cells are created by iterating the columnNames like this: <ng-container *ngFor="let disCol of columnNames" matColumnDef="{{ disCol }">. The header cells are created like this: <th mat-header-cell *matHeaderCellDef>{{ disCol }</th>. The table cells are created like this: <td mat-cell *matCellDef="let element">{{ element.get(disCol) }</td>. element is the map of the columnData array element and the map value is retrieved with element.get(disCol). Summary Frontend The new Angular syntax makes the templates more readable. The Angular Material table component is more flexible than expected and supports unknown numbers of columns very well. Conclusion To question a database with the help of an AI/LLM takes some effort for the metadata and a rough idea of the users what the database contains. AI/LLMs are not a natural fit for query creation because SQL queries require correctness. A pretty large model was needed to get the required query correctness, and GPU acceleration is required for productive use. A well-designed UI where the user can drag and drop the columns of the tables in the result table might be a good alternative for the requirements. Angular Material Components support drag and drop very well. Before starting such a project the customer should make an informed decision on what alternative fits the requirements best.

By Sven Loesekann
Why and How To Integrate Elastic APM in Apache JMeter
Why and How To Integrate Elastic APM in Apache JMeter

The Advantages of Elastic APM for Observing the Tested Environment My first use of the Elastic Application Performance Monitoring (Elastic APM) solution coincides with projects that were developed based on microservices in 2019 for the projects on which I was responsible for performance testing. At that time (2019) the first versions of Elastic APM were released. I was attracted by the easy installation of agents, the numerous protocols supported by the Java agent (see Elastic supported technologies) including the Apache HttpClient used in JMeter and other languages (Go, .NET, Node.js, PHP, Python, Ruby), and the quality of the dashboard in Kibana for the APM. I found the information displayed in the Kibana APM dashboards to be relevant and not too verbose. The Java agent monitoring is simple but displays essential information on the machine's OS and JVM. The open-source aspect and the free solution for the main functions of the tool were also decisive. I generalize the use of the Elastic APM solution in performance environments for all projects. With Elastic APM, I have the timelines of the different calls and exchanges between web services, the SQL queries executed, the exchange of messages by JMS file, and monitoring. I also have quick access to errors or exceptions thrown in Java applications. Why Integrate Elastic APM in Apache JMeter By adding Java APM Agents to web applications, we find the services called timelines in the Kibana dashboards. However, we remain at a REST API call level mainly, because we do not have the notion of a page. For example, page PAGE01 will make the following API calls: /rest/service1 /rest/service2 /rest/service3 On another page, PAGE02 will make the following calls: /rest/service2 /rest/service4 /rest/service5 /rest/service6 The third page, PAGE03, will make the following calls: /rest/service1 /rest/service2 /rest/service4 In this example, service2 is called on 3 different pages and service4 in 2 pages. If we look in the Kibana dashboard for service2, we will find the union of the calls of the 3 calls corresponding to the 3 pages, but we don't have the notion of a page. We cannot answer "In this page, what is the breakdown of time in the different REST calls," because for a user of the application, the notion of page response time is important. The goal of the jmeter-elastic-apm tool is to add the notion of an existing page in JMeter in the Transaction Controller. This starts in JMeter by creating an APM transaction, and then propagating this transaction identifier (traceparent) with the Elastic agent to an HTTP REST request to web services because the APM Agent recognizes the Apache HttpClient library and can instrument it. In the HTTP request, the APM Agent will add the identifier of the APM transaction to the header of the HTTP request. The headers added are traceparent and elastic-apm-traceparent. We start from the notion of the page in JMeter (Transaction Controller) to go to the HTTP calls of the web application (gestdoc) hosted in Tomcat. In the case of an application composed of multi-web services, we will see in the timeline the different web services called in HTTP(s) or JMS and the time spent in each web service. This is an example of technical architecture for a performance test with Apache JMeter and Elastic APM Agent to test a web application hosted in Apache Tomcat. How the jmeter-elastic-apm Tool Works jmeter-elastic-apm adds Groovy code before a JMeter Transaction Controller to create an APM transaction before a page. In the JMeter Transaction Controller, we find HTTP samplers that make REST HTTP(s) calls to the services. The Elastic APM Agent automatically adds a new traceparent header containing the identifier of the APM transaction because it recognizes the Apache HttpClient of the HTTP sampler. The Groovy code terminates the APM transaction to indicate the end of the page. The jmeter-elastic-apm tool automates the addition of Groovy code before and after the JMeter Transaction Controller. The jmeter-elastic-apm tool is open source on GitHub (see link in the Conclusion section of this article). This JMeter script is simple with 3 pages in 3 JMeter Transaction Controllers. After launching the jmeter-elastic-apm action ADD tool, the JMeter Transaction Controllers are surrounded by Groovy code to create an APM transaction before the JMeter Transaction Controller and close the APM transaction after the JMeter Transaction Controller. In the “groovy begin transaction apm” sampler, the Groovy code calls the Elastic APM API (simplified version): Groovy Transaction transaction = ElasticApm.startTransaction(); Scope scope = transaction.activate(); transaction.setName(transactionName); // contains JMeter Transaction Controller Name In the “groovy end transaction apm” sampler, the groovy code calls the ElasticApm API (simplified version): Groovy transaction.end(); Configuring Apache JMeter With the Elastic APM Agent and the APM Library Start Apache JMeter With Elastic APM Agent and Elastic APM API Library Declare the Elastic APM Agent URLto find the APM Agent: Add the ELASTIC APM Agent somewhere in the filesystem (could be in the <JMETER_HOME>\lib but not mandatory). In <JMETER_HOME>\bin, modify the jmeter.bat or setenv.bat. Add Elastic APM configuration like so: Shell set APM_SERVICE_NAME=yourServiceName set APM_ENVIRONMENT=yourEnvironment set APM_SERVER_URL=http://apm_host:8200 set JVM_ARGS=-javaagent:<PATH_TO_AGENT_APM_JAR>\elastic-apm-agent-<version>.jar -Delastic.apm.service_name=%APM_SERVICE_NAME% -Delastic.apm.environment=%APM_ENVIRONMENT% -Delastic.apm.server_urls=%APM_SERVER_URL% 2. Add the Elastic APM library: Add the Elastic APM API library to the <JMETER_HOME>\lib\apm-agent-api-<version>.jar. This library is used by JSR223 Groovy code. Use this URL to find the APM library. Recommendations on the Impact of Adding Elastic APM in JMeter The APM Agent will intercept and modify all HTTP sampler calls, and this information will be stored in Elasticsearch. It is preferable to voluntarily disable the HTTP request of static elements (images, CSS, JavaScript, fonts, etc.) which can generate a large number of requests but are not very useful in analyzing the timeline. In the case of heavy load testing, it's recommended to change the elastic.apm.transaction_sample_rate parameter to only take part of the calls so as not to saturate the APM Server and Elasticsearch. This elastic.apm.transaction_sample_rate parameter can be declared in <JMETER_HOME>\jmeter.bat or setenv.bat but also in a JSR223 sampler with a short Groovy code in a setUp thread group. Groovy code records only 50% samples: Groovy import co.elastic.apm.api.ElasticApm; // update elastic.apm.transaction_sample_rate ElasticApm.setConfig("transaction_sample_rate","0.5"); Conclusion The jmeter-elastic-apm tool allows you to easily integrate the Elastic APM solution into JMeter and add the notion of a page in the timelines of Kibana APM dashboards. Elastic APM + Apache JMeter is an excellent solution for understanding how the environment works during a performance test with simple monitoring, quality dashboards, time breakdown timelines in the different distributed application layers, and the display of exceptions in web services. Over time, the Elastic APM solution only gets better. I strongly recommend it, of course, in a performance testing context, but it also has many advantages in the context of a development environment used for developers or integration used by functional or technical testers. Links Command Line Tool jmeter-elastic-apm JMeter plugin elastic-apm-jmeter-plugin Elastic APM Guides: APM Guide or Application performance monitoring (APM)

By Vincent DABURON

Top Tools Experts

expert thumbnail

Bartłomiej Żyliński

Software Engineer,
SoftwareMill

I'm a Software Engineer with industry experience in designing and implementing complex applications and systems, mostly where it's not visible to users - at the backend. I'm a self-taught developer and a hands-on learner, constantly working towards expanding my knowledge further. I contribute to several open source projects, my main focus being sttp (where you can see my contributions on the project's Github). I appreciate the exchange ef technical know-how - which is expressed by my various publications found on Medium and DZone, and appearances at top tech conferences and meetups, including Devoxx Belgium. I enjoy exploring topics that combine software engineering and mathematics. In my free time, I like to read a good book.
expert thumbnail

Abhishek Gupta

Principal Developer Advocate,
AWS

I mostly work on open-source technologies including distributed data systems, Kubernetes and Go
expert thumbnail

Yitaek Hwang

Software Engineer,
NYDIG

The Latest Tools Topics

article thumbnail
Master AWS IAM Role Configuration With Terraform
When you mix what AWS IAM can do with how Terraform lets you manage infrastructure through code, setting up secure and effective roles becomes simpler.
July 9, 2024
by Rom Carmel
· 784 Views · 2 Likes
article thumbnail
Mastering Serverless Debugging
Discover effective strategies for debugging serverless in general and AWS Lambda. Serverless can be painful, but not a bottomless pit of despair.
July 8, 2024
by Shai Almog DZone Core CORE
· 1,033 Views · 2 Likes
article thumbnail
Writing a Simple Pulumi Provider for Airbyte
Explore a simple example of writing a Pulumi provider for Airbyte. Instead of using the official Terraform provider, implement a Pulumi provider in Python.
July 8, 2024
by Carlo Scarioni
· 1,331 Views · 1 Like
article thumbnail
AWS: Metric Filter vs. Subscription Filter
In this blog on AWS, let’s do a comparison study between two filter tools available with Amazon CloudWatch Logs — Metric Filter and Subscription Filter.
July 8, 2024
by PRAVEEN SUNDAR
· 1,172 Views · 1 Like
article thumbnail
The Rise of Kubernetes: Reshaping the Future of Application Development
Kubernetes has become essential for modern app development. Learn how it's evolving to support AI/ML workloads and changing the developer landscape.
July 8, 2024
by Tom Smith DZone Core CORE
· 1,649 Views · 2 Likes
article thumbnail
Enhance IaC Security With Mend Scans
Learn to incorporate Mend into your IaC workflows, improve infrastructure security posture, reduce the risk of misconfigurations, and ensure compliance.
July 5, 2024
by Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE
· 2,313 Views · 3 Likes
article thumbnail
Essential Monitoring Tools, Troubleshooting Techniques, and Best Practices for Atlassian Tools Administrators
This article explores leveraging various monitoring tools to identify, diagnose, and resolve issues in these essential development and collaboration platforms.
July 5, 2024
by Prashanth Ravula DZone Core CORE
· 2,720 Views · 2 Likes
article thumbnail
Linting Excellence: How Black, isort, and Ruff Elevate Python Code Quality
In this article, explore Black, isort, and Ruff to streamline Python code quality checks and ensure consistent coding standards.
July 4, 2024
by Prince Bose
· 1,819 Views · 4 Likes
article thumbnail
Data Integration Technology Maturity Curve 2024-2030
When it comes to data integration, some people may wonder what there is to discuss— isn't it just ETL? Learn more in this post.
July 3, 2024
by William Guo
· 3,013 Views · 2 Likes
article thumbnail
Mastering Distributed Caching on AWS: Strategies, Services, and Best Practices
Distributed caching on AWS enhances app performance and scalability. AWS provides ElastiCache (Redis, Memcached) and DAX for implementation.
July 3, 2024
by Raghava Dittakavi DZone Core CORE
· 2,487 Views · 2 Likes
article thumbnail
Trigger Salesforce Assignment Rules and Send Notifications From MuleSoft
This blog describes the process of triggering Salesforce lead or case assignment rules and sending email notifications to owners when creating records from MuleSoft.
July 2, 2024
by Ujala Kumar Yadav
· 1,394 Views · 2 Likes
article thumbnail
How to Configure Custom Metrics in AWS Elastic Beanstalk Using Memory Metrics Example
By default, CloudWatch does not provide any memory metrics, but by a simple configuration, it's possible to add them to the monitoring dashboard.
July 2, 2024
by Alexander Sharov
· 2,295 Views · 1 Like
article thumbnail
Performance and Scalability Analysis of Redis and Memcached
This article benchmarks Redis and Memcached, popular in-memory data stores, to help decision-makers choose the best solution for their needs.
July 2, 2024
by RAHUL CHANDEL
· 3,135 Views · 2 Likes
article thumbnail
GBase 8a Implementation Guide: Resource Assessment
The storage space requirements for a GBase cluster are calculated based on the data volume, the choice of compression algorithm, and the number of cluster replicas.
July 1, 2024
by Cong Li
· 2,202 Views · 1 Like
article thumbnail
A Look Into Netflix System Architecture
Netflix Architecture is designed to efficiently and reliably provide content to millions of consumers at once. Here's a breakdown of its characteristics and components.
July 1, 2024
by Rahul Shivalkar
· 5,718 Views · 7 Likes
article thumbnail
High Availability and Disaster Recovery (HADR) in SQL Server on AWS
We will guide you through the process of configuring HADR for SQL Server on AWS, providing practical code examples for setting up and recovering a SQL Server database.
July 1, 2024
by Vijay Panwar DZone Core CORE
· 3,434 Views · 1 Like
article thumbnail
Terraform Tips for Efficient Infrastructure Management
Securely manage your state files, use reusable modules, organize your code, and integrate automation to elevate your Terraform infrastructure management.
July 1, 2024
by Mariusz Michalowski
· 3,768 Views · 1 Like
article thumbnail
Use AWS Generative AI CDK Constructs To Speed up App Development
In this blog, we will use the AWS Generative AI Constructs Library to deploy a complete RAG application using multiple components.
July 1, 2024
by Abhishek Gupta DZone Core CORE
· 5,026 Views · 1 Like
article thumbnail
Explainable AI: Seven Tools and Techniques for Model Interpretability
Struggling to understand complex AI models? This blog dives into seven powerful XAI tools like LIME, SHAP, and Integrated Gradients. Perfect for software developers!
July 1, 2024
by Sameer Danave
· 3,016 Views · 2 Likes
article thumbnail
Implementing Real-Time Credit Card Fraud Detection With Apache Flink on AWS
Real-time fraud detection systems are essential for identifying and preventing fraudulent transactions as they occur. Apache Flink is useful in this scenario.
July 1, 2024
by Harsh Daiya
· 3,336 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: