Software Training Institute

brollyacademy

Azure Data Engineer interview Questions

100 Interview Questions for Freshers 1 -2 Experienced Candidates
Azure Data Engineer interview Questions

Azure Data Engineer Interview Questions

1. What is Microsoft Azure?

Microsoft Azure is a cloud computing platform that provides both hardware and software. The service provider creates a managed service here to enable users to access these services on demand.

2. List the data masking features Azure has.

When it comes to data security, dynamic data masking has several vital roles and contains sensitive data to a certain specific set of users. Some of its features are:

  • It’s available for Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics.
  • It can be carried out as a security policy on all the different SQL databases across the Azure subscription.
  • The levels of masking can be controlled per the users’ needs.

3. What is meant by a Polybase?

Polybase is used for optimizing data ingestion into the PDW and supporting T-SQL. It lets developers transfer external data transparently from supported data stores, no matter the storage architecture of the external data store.

4. Define reserved capacity in Azure.

Microsoft has included a reserved capacity option in Azure storage to optimize costs. The reserved storage gives its customers a fixed amount of capacity during the reservation period on the Azure cloud. 

5. What is meant by the Azure Data Factory?

Azure Data Factory is a cloud-based integration service that lets users build data-driven workflows within the cloud to arrange and automate data movement and transformation. Using Azure Data Factory, you can:

  • Develop and schedule data-driven workflows that can take data from different data stores.
  • Process and transform data with the help of computing services such as HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning. 

Learn Azure Data Engineer From Our Expert Trainer

6. What do you mean by blob storage in Azure?

It is a service that lets users store massive amounts of unstructured object data such as binary data or text. It can even be used to publicly showcase data or privately store the application data. Blog storage is commonly used for:

  • Providing images or documents to a browser directly
  • Audio and video streaming
  • Data storage for backup and restore disaster recovery
  • Data storage for analysis using an on-premises or Azure-hosted service

7. Define the steps involved in creating the ETL process in Azure Data Factory.

The steps involved in creating the ETL process in Azure Data Factory are:

  • In the SQL Server Database, create a Linked Service for the source data store
  • For the destination data store, build a Linked Service that is the Azure Data Lake Store
  • For Data Saving purposes, create a dataset
  • Build the pipeline and then add the copy activity
  • Plan the pipeline by attaching a trigger

8. Define serverless database computing in Azure.

The program code is typically present either on the client-side or the server. However, serverless computing accompanies the stateless code nature, which means the code doesn’t need any infrastructure.

Users have to pay to access the compute resources the code uses within the brief period in which the code is being executed. It’s cost-effective, and users need to pay only for the resources they have used. 

9.Explain the top-level concepts of Azure Data Factory.

  1. Pipeline

Used as a carrier for the numerous processes taking place. Every individual process is known as an activity.

  1. Activities

Activities stand for the process steps involved in a pipeline. A pipeline has one or multiple activities and can be anything. This means querying a data set or transferring the dataset from one source to the other. 

  1. Datasets

Simply put, it’s a structure that holds the data.

  1. Linked Services

Used for storing critical information when connecting an external source.

10. What is Data Engineering?

The application of data collecting and analysis is the emphasis of data engineering. The information gathered from numerous sources is merely raw information.

Data engineering helps in the transformation of unusable data into useful information. It is the process of transforming, cleansing, profiling, and aggregating huge data sets in a nutshell.

Want to download these Azure Data Engineer Interview Questions and Answers in the form of a readable and printable PDF for your interview preparation? 

Click the download button below for the PDF version

Learn Azure Data Engineer From Our Expert Trainer

11. What is Azure Synapse analytics?

Azure Synapse is an enterprise service that accelerates time to discern across data storage and tectonic data networks.

Azure Synapse combines the stylish of SQL(Structured Query Language) technologies used in enterprise data warehousing, Spark technologies used for big data, Pipelines for data integration and ETL/ ELT, and deep integration with different Azure services like Power BI, CosmosDB, and AzureML.

12. Why is the Azure data factory needed?

The amount of data generated these days is vast, coming from different sources. When we move this particular data to the cloud, a few things must be taken care of-

 Data can be in any form as it comes from different sources, and these various sources will transfer or channel the data in different ways, and it can be in other formats.

When we bring this data to the cloud or particular storage, we need to make sure that this data is well managed.

i.e., you need to transform the data and delete unnecessary parts. As per moving the data is concerned, we need to make sure that information is picked from different sources and brought to one common place, then stored, and if required, we should transform it into something more meaningful.

A traditional data warehouse can also do this, but certain disadvantages exist. Sometimes we are forced to go ahead and have custom applications that deal with all these processes individually, which is time-consuming, and integrating all these sources is a huge pain. 

 A data factory helps to orchestrate this complete process in a more manageable or organizable manner.

13. What is Azure Synapse Runtime?

Apache Spark pools in Azure Synapse use runtimes to tie together essential component versions, Azure Synapse optimizations, packages, and connectors with a specific Apache Spark version.

These runtimes will be upgraded periodically to include new improvements, features, and patches. 

These runtimes have the following advantages:

  • Faster session startup times.
  • Tested compatibility with specific Apache Spark versions.
  • Access to popular, compatible connectors and open-source packages.

 

14. What is SerDe in the hive?

Serializer/Deserializer is popularly known as SerDe. For IO(Input/Output), Hive employs the SerDe protocol. The interface handles serialization and deserialization, which interprets serialization results as separate fields for processing.

The Deserializer turns a record into a Hive-compatible Java object. The Serializer now turns this Java object into an HDFS (Hadoop Distributed File System) -compatible format. The storage role is then taken over by HDFS. Anyone can create their SerDe for their data format.

15.Explain the architecture of Azure Synapse Analytics

It is designed to process massive amounts of data with hundreds of millions of rows in a table.

Azure Synapse Analytics processes complex queries and returns the results within seconds, even with massive data, because Synapse SQL runs on a Massively Parallel Processing (MPP) architecture that distributes data processing across multiple nodes.

Applications connect to a control node as a point of entry to the Synapse Analytics MPP engine. On receiving the Synapse SQL query, the control node breaks it down into MPP-optimized format.

Further, the individual operations are forwarded to the compute nodes that can perform the operations in parallel, resulting in much better query performance.

Learn Azure Data Engineer From Our Expert Trainer

16. What are the various windowing functions in Azure Stream Analytics?

A window in Azure Stream Analytics refers to a block of time-stamped event data that enables users to perform various statistical operations on the event data.

Four types of windowing functions are available to partition and analyze a window in Azure Stream Analytics:

  • Tumbling Window: The data stream is segmented into distinct fixed-length time segments in the tumbling window function.
  • Hopping Window: In hopping windows, the data segments can overlap.
  • Sliding Window: Unlike, Tumbling and Hopping window, aggregation occurs every time a new event occurs.
  • Session Window:  There is no fixed window size and has three parameters: timeout, max duration and partitioning key. The purpose of this window is to eliminate quiet periods in the data stream.

17. Explore Azure storage explorer and its uses

It is a versatile standalone application available for Windows, Mac OS and Linux to manage Azure Storage from any platform. Azure Storage can be downloaded from Microsoft.

It provides access to multiple Azure data stores such as ADLS Gen2, Cosmos DB, Blobs, Queues, Tables, etc., with an easy to navigate GUI.

One of the key features of Azure Storage Explorer is that it allows users to work even when they are disconnected from the Azure cloud service by attaching local emulators.

18. What is Azure Databricks, and how is it different from regular data bricks?

It is the Azure implementation of Apache Spark, an open-source big data processing platform. Azure Databricks in the data preparation or processing stage in the data lifecycle.

First, data is ingested in Azure using Data Factory and stored in permanent storage (such as ADLS Gen2 or Blob Storage).

Further, data is processed using Machine Learning (ML) in Databricks, and then extracted insights are loaded into the Analysis Services in Azure like Azure Synapse Analytics or Cosmos DB.

Finally, insights are visualized and presented to the end-users with the help of Analytical reporting tools like Power BI.

19. How is a pipeline scheduled?

To schedule a pipeline, you could take the help of the scheduler trigger or the time window trigger. This trigger uses the wall-clock calendar schedule and can plan pipelines at periodic intervals or calendar-based recurring patterns.

20. What’s the significance of the Azure Cosmos DB synthetic partition key?

To distribute the data uniformly across multiple partitions, selecting a good partition key is pretty important. A Synthetic partition key can be developed when there isn’t any right column with properly distributed values.

Here are the three ways in which a synthetic partition key can be created:

  1. Concatenate Properties: Combine several property values to create a synthetic partition key.
  2. Random Suffix: A random number is added at the end of the partition key’s value.
  3. Pre-calculated Suffix: Add a pre-calculated number to the end of the partition to enhance read performance.

Read How to Prepare for Data Engineer Interviews to get interview-ready.

Want to download these Azure Data Engineer Interview Questions and Answers in the form of a readable and printable PDF for your interview preparation? 

Click the download button below for the PDF version

Learn Azure Data Engineer From Our Expert Trainer

21. Which Data Factory version needs to be used to create data flows?

Using the Data Factory V2 version is recommended when creating data flows.

22. How to pass the parameters to a pipeline run?

In Data Factory, parameters are a top-tier concept. They can be defined at the pipeline level, followed by the passing of arguments to execute the pipeline run on-demand or upon using a trigger. 

23. What is serverless database computing in Azure?

In a typical computing scenario, the program code resides either on the server or the client-side. But Serverless computing follows the stateless code nature, i.e. the code does not require any infrastructure.

Users have to pay for the compute resources used by the code during a short period while executing the code. It is very cost-effective, and users only need to pay for the resources used.

24. What Data security options are available in Azure SQL DB?

The data security options available in Azure SQL DB are:

  • Azure SQL Firewall Rules: Azure provides two levels of security. The first is server-level firewall rules that are stored in the SQL Master database and determine the access to the Azure database server. The second is database-level firewall rules that govern access to the individual databases.
  • Azure SQL Always Encrypted: It is designed to protect sensitive data such as credit card numbers stored in the Azure SQL database.
  • Azure SQL Transparent Data Encryption (TDE): The technology used to encrypt stored data in the Azure SQL Database. The encryption/decryption of database and backups/transactions of log files happens in real-time using TDE.
  • Azure SQL Database Auditing: Azure provides auditing capabilities within the SQL Database service. It allows defining the audit policy at the database server or individual database level.

25. What are some ways to ingest data from on-premise storage to Azure?

While choosing a data transfer solution, the main factors to consider are:

  • Data Size
  • Data Transfer Frequency (One-time or Periodic)
  • Network Bandwidth

Based on the above factors, data movement solutions can be:

  • Offline transfer: It is used for one-time bulk data transfer. Thus, Microsoft can provide customers with disks or secure storage devices, or even customers can also ship their disks to Microsoft. The offline options for transfer are named data box, data box disk, data box heavy and import/export (customer’s own disks).
  • Network transfer: Over a network connection, data transfer can be performed in the following ways:
    • Graphical Interface: This is ideal while transferring a few files and when there is no need to automate the data transfer. Graphical interface options include Azure Storage Explorer and Azure Portal.
    • Programmatic Transfer: Some available scriptable data transfer tools are AzCopy, Azure PowerShell, Azure CLI. Various programming language SDKs are also available.
    • On-premises devices: A physical device called Data Box Edge and a virtual Data Box Gateway are installed at the customer’s premises, optimizing the data transfer to Azure.
    • Managed Data Factory pipeline: Azure Data Factory pipelines can move, transform and automate regular data transfers from on-prem data stores to Azure.

Learn Azure Data Engineer From Our Expert Trainer

26. Mention some common applications of Blob storage?

Common works of Blob Storage consists of:

  •  Laboring images or documents straight to a browser.
  • Saving files for shared access.
  • Streaming audio and video.
  •  Collecting data for backup and recovery disaster restoration, and archiving.
  • Saving data for analysis by an on-premises or Azure-hosted.

27. What is the Star scheme?

Star Join Schema or Star Schema is the most manageable type of Data Warehouse schema. This is called a star schema because its construction is like a star.

In this, the heart of the star may have one particular table and various connected dimension tables. This schema is practiced for questioning large data sets.

28. How would you approve data to move from one database to another?

The efficiency of data and guaranteeing that no data is released should be of the highest priority for a data engineer. Hiring administrators examine this question to know your thought method on how validation of data would occur. 

The candidate should be capable of talking about proper validation representations in different situations. For example, you could recommend that validation could be a simplistic comparison, or it can occur after the comprehensive data migration.

 

29. What do you mean by data pipeline?

A data pipeline is a system for transporting data from one location (the source) to another (the destination), such as a data warehouse. Data is converted and optimized along the journey, and it eventually reaches a state that can be evaluated and used to produce business insights. 

The procedures involved in aggregating, organizing, and transporting data are referred to as a data pipeline. Many of the manual tasks needed in processing and improving continuous data loads are automated by modern data pipelines.

30. What is the best way to migrate data from an on-premise database to Azure?

It is crucial to select a good partition key that can distribute the data evenly across multiple partitions. We can create a Synthetic partition key when there is no right column with properly distributed values. The three ways to create a synthetic partition key are:

  • Concatenate Properties: Combine multiple property values to form a synthetic partition key.
  • Random Suffix: A random number is added to the end of the partition key value.
  • Pre-calculated Suffix: A pre-calculate numbed is added to the end of the partition value to improve the read performance

Want to download these Azure Data Engineer Interview Questions and Answers in the form of a readable and printable PDF for your interview preparation? 

Click the download button below for the PDF version

Learn Azure Data Engineer From Our Expert Trainer

31. What is the Azure Cosmos DB synthetic partition key?

As its name implies, recovery scenarios are used to recover from the failure of a test case/teststep. During test execution, if your test case fails (this could be any reason), the Recovery scenario ensures that the next test case after that failed test case will proceed normally.

This will ensure that the following test case runs correctly, without any impact from the previous test case. There are two conditions in which recovery is possible i.e., dialog failure and verification failure.

When a recovery scenario fails, Tosca then tries the following higher-level recovery scenario, and if all recovery scenarios fall, it reports the test case as failed. 

32. How is data security implemented in ADLS Gen2?

ADLS Gen2 has a multi-layered security model. The data security layers of ADLS Gen2 are:

  • Authentication: It provides user account security with three authentication modes, Azure Active Directory (AAD), Shared Key and Shared Access Token (SAS).
  • Access Control: It restricts access to the individual containers or files using Roles and Access Control Lists (ACLs).
  • Network Isolation: It enables admins to enable or disable access to specific Virtual Private Networks (VPNs) or IP Addresses.
  • Data Protection: Encrypts in-transit data using HTTPS.
  • Advanced Threat Protection: It allows monitoring of unauthorized attempts to access or exploit the storage account.
  • Auditing: It is the final layer of security in which ADLS Gen2 provides comprehensive auditing features where all account management activity is logged.

33. What are pipelines and activities in Azure?

The grouping of activities arranged to accomplish a task together is known as Pipelines. It allows users to manage the individual activities as a single group and provide a quick overview of the activities involved in a complex task with many steps.

ADF activities are grouped into three parts:

  • Data Movement Activities – Used to ingest data into Azure or export data from Azure to external data stores.
  • Data Transformation Activities – Related to data processing and extracting information from data.
  • Control Activities – Specify a condition or affect the progress of the pipeline.

34. How do you manually execute the Data factory pipeline?

A pipeline can run with Manual or On-demand execution.

To execute the pipeline manually or programmatically, we can use the PowerShell command:

Invoke-AzDataFactoryV2Pipeline -DataFactory $df -PipelineName

“DemoPipeline” -ParameterFile .\PipelineParameters.json

The term ‘DemoPipeline’ above is the pipeline’s name that will run, and the ‘ParameterFile’ specifies the path of a JSON file with the source and sink path.

Also, the format of the JSON file is to be passed as a parameter to the above PowerShell command is:

{

  “sourceBlobContainer”: “MySourceFolder”,

  “sinkBlobContainer”: “MySinkFolder”



35. What is the trigger execution in Azure Data Factory?

In Azure Data Factory, pipelines can be triggered or automated.

Some ways to automate or trigger the execution of Azure Data Factory Pipelines are:

  • Schedule Trigger: It invokes a pipeline execution at a fixed time or on a fixed schedule such as weekly, monthly etc.
  • Tumbling Window Trigger: It executes Azure Data Factory Pipeline at fixed periodic time intervals without overlap from a specified start time.
  • Event-Based Trigger: It executes an Azure Data Factory Pipeline based on the occurrence of some event, such as the arrival or deletion of a new file in Azure Blob Storage.

Learn Azure Data Engineer From Our Expert Trainer

36. What are mapping Dataflows?

Microsoft provides Mapping Data Flows that do not require writing code for a more straightforward data integration experience than Data Factory Pipelines. It is a visual way to design data transformation flows.

The data flow becomes Azure Data Factory (ADF) activities and gets executed as a part of the ADF pipelines.

37. What is the full form of HDFS?

Hadoop Distributed File System is the full form of HDFS.

38. Define Blocks and Block Scanner in HDFS

Blocks are the smallest unit of data files. Hadoop naturally parts multiple files into little pieces.

Block Scanner checks the list of blocks that are introduced on a DataNode.

39. What are the steps that happen when Block Scanner identifies a corrupted data block?

The following are the steps that happen when Block Scanner locates a corrupted data square:

1) When Block Scanner locates a corrupted data square, DataNode reports to NameNode

2) NameNode starts the way toward making the replica of the corrupted block.

3) The replication score of the right replicas attempts to coordinate with the replication factor

40. Name two messages that NameNode gets from DataNode?

There are two messages which NameNode gets from DataNode. They are 1) Block report and 2) Heartbeat.

Want to download these Azure Data Engineer Interview Questions and Answers in the form of a readable and printable PDF for your interview preparation? 

Click the download button below for the PDF version

Learn Azure Data Engineer From Our Expert Trainer

41.List out different XML arrangement records in Hadoop?

There are five XML arrangement records in Hadoop:

  • Mapred-site
  • Core-site
  • HDFS-site
  • Yarn-site

42.Assume that you have around 1 TB of data stored in Azure blob storage. This data is in multiple CSV files. You are asked to do a couple of transformations on this data as per business logic and needs, before moving this data into the staging container. How would you plan and architect the solution for this given scenario? Explain with the details.

First of all, we need to analyze the situation. Here if you closely look at the size of the data, you find that it is very huge in size.

Hence directly doing the transformation on such a massive size of data could be a very cumbersome and time-consuming process.

Hence we should think about the big data processing mechanism where we can leverage the parallel and distributed computing advantages. Here we have two choices.

  1. We can use the Hadoop MapReduce through HDInsight capability for doing the transformation.
  2. We can also think of using the spark through the Azure databricks for doing the transformation on such a huge scale of data.

Out of these two, Spark on Azure databricks is a better choice because Spark is much faster than Hadoop due to in-memory computation. So let’s choose the Azure databricks as the option.

Next, we need to create the pipeline in the Azure data factory. A pipeline should use the databricks notebook as an activity.  

We can write all the business related transformation logic into the Spark notebook. A notebook can be executed using either python, scala, or java language. 

When you execute the pipeline it will trigger the Azure databricks notebook and your analytics algorithm logic runs and does transformations as you defined in the Notebook. In the notebook itself, you can write the reason to store the output in the blob storage Staging area.

43. What are four V's of big data?

Four V’s of big data are:

  • Velocity
  • Variety
  • Volume
  • Veracity

44. Explain the highlights of Hadoop.

Significant highlights of Hadoop are:

It is an open-source structure that is accessible to freeware.

Hadoop is good with many sorts of equipment and simple to get to new equipment inside a particular hub.

Hadoop underpins quicker circulated preparation of information.

It stores the information in the group, which is free from the rest of the activities.

Hadoop allows making three copies for each block with various hubs.

45. Explain the fundamental strategies for Reducer.

set up (): It is for designing boundaries like the size of data and distributed reserve.

cleanup(): This technique is used to clean temporary records.

reduce (): It is a heart of the reducer which is called once per key with the associated reduced task.

Learn Azure Data Engineer From Our Expert Trainer

46. What is the form of COSHH?

The form of COSHH is Classification and Optimization-based Schedule for Heterogeneous Hadoop System.

47) Explain Hadoop distributed file system

Hadoop works with an adaptable distributed file system like S3, HFTP FS, FS and HDFS. Hadoop Distributed File System is created on the Google File System. This document framework is planned so that there can be a sudden spike in demand for a huge group of the PC system.

48) Explain the fundamental obligations of an information engineer.

Information engineers deal with the source arrangement of information. They also rearrange complex information structure and forestall the reduplication of information. Commonly, they likewise give ELT and information change.

49) What is the full type of YARN?

The full type of YARN is Yet Another Resource Negotiator.

50) List different modes in Hadoop.

Modes in Hadoop are 1) Standalone mode 2) Pseudo distributed mode 3) Fully distributed mode.

Want to download these Azure Data Engineer Interview Questions and Answers in the form of a readable and printable PDF for your interview preparation? 

Click the download button below for the PDF version

Learn Azure Data Engineer From Our Expert Trainer

51) How do you accomplish security in Hadoop?

Following are the accompanying strides to accomplish security in Hadoop:

1) The initial step is to make sure about the verification channel of the client to the server. Give the time-stamped to the customer.

2) In the second step, the customer utilizes the received time-stamped to demand TGS for a service ticket.

52.Assume that you are working as the Azure data engineer lead at Azurelib.com, You have been asked to review the Azure function written by one of the team members. How would you do the code review with cost optimization in mind?

In the Azure function, the cost mainly occurred because of two factors. The first is the Memory it takes for a single execution run. The second is the total execution time it takes. Azure function cost is based on these two factors only.

Hence, when you review the Az function, check for those two factors from a cost optimization perspective.

53.You are a data engineer junior who works with Azurelib.com. Assume that there are a couple of existing logic apps workflows that are failing due to some reasons. You have been asked to check, where you can find the details for them?

I will go to the Azure logic app account in the Azure portal. Under the Logic App overview section, you will see the  Run History tab. Click on it and then you can review all the logic app run history.

54.What is the linked service in the Azure data factory?

Linked service is one of the components in the Azure data factory which is used to make a connection hence to connect to any of the data sources you have first to create the linked service based on the type of data source.

The linked service could have different parameters, for example in the case of the SQL Server linked service you probably have to give the server name, username, and password but for connecting to the Azure blob storage you have to provide the storage location details.

55.What is the dataset in the Azure Data factory?

Dataset needs to read-write data to any data source using the ADF. Dataset is the representation of the type of data held by the data source.

Learn Azure Data Engineer From Our Expert Trainer

56.What is the foreach activity in the data factory?

In the Azure data factory whenever you have to do some of the work repetitively then you will probably be going to use it for each activity. For each activity, you pass an array and the foreach loop will run for all the items of this array.

As of now nested for each activity is not allowed which means you cannot have one for each activity into another for each activity.

57. What is the get metadata activity in the Azure data factory?

 The Azure data factory uses metadata activity to get information about the files. For example, in some cases, you want to know the name of the file you want to know, the size of the file, or maybe you want to know the last modified date of a file.

So all those kinds of metadata information about the file you can get it using the get method activity.

58. Is it possible to connect MongoDB DB from the Azure data factory?

Yes, it is possible to connect MongoDB from the Azure data factory. You have to provide the proper connection information about the MongoDB server.

In case this MongoDB server is residing outside the Azure workspace then probably you have to create a self-hosted integration runtime, through which you can connect to the MongoDB server.

59. What is Synapse SQL?

Synapse SQL is the ability to do T-SQL based analytics in the Synapse workspace. Synapse SQL has two consumption models: dedicated and serverless. For the dedicated model, use dedicated SQL pools.
A workspace can have any number of these pools. To use the serverless model, use the serverless SQL pools. Every workspace has one of these pools.

Inside Synapse Studio, you can work with SQL pools by running SQL scripts.

60.How can you use Apache Spark through Azure Synapse Analytics?

In Azure analytics, you can run the spark code either using the notebook or you can create a job that will run the spark code. For running the Spark code you need a Spark pool which is nothing but a cluster of the nodes having a spark installed on it.

Want to download these Azure Data Engineer Interview Questions and Answers in the form of a readable and printable PDF for your interview preparation? 

Click the download button below for the PDF version

Other Related Topics Interview Questions

Learn Azure Data Engineer From Our Expert Trainer