Lesson
21 of 26
Translate

Machine Learning & Big Data

Reinforcement Learning, Transfer Learning, Fine-Tuning, LSTM, Hadoop, MapReduce, NameNode, data analytics for UPSSSC AGTA.

What is Machine Learning?

Machine Learning (ML) is a branch of Artificial Intelligence where computers learn from data and improve their performance without being explicitly programmed. Instead of writing rules manually, we feed data to an algorithm and it discovers patterns on its own.

Example: A spam filter learns from thousands of emails which ones are spam and which are not — it gets better over time as it sees more data.


Types of Machine Learning

1. Supervised Learning

In Supervised Learning, the algorithm is trained on labeled data — each input has a known correct output. The model learns the mapping from inputs to outputs.

  • Use case: Predicting crop yield from weather data, email spam detection, disease diagnosis
  • Think of it as learning with a teacher who provides correct answers
AlgorithmTypeUse Case
Linear RegressionRegression (continuous output)Predicting price, temperature, yield
Logistic RegressionClassification (categories)Yes/No predictions (spam or not)
Decision TreeBothEasy to interpret, tree-like decisions
Random ForestBothMultiple decision trees combined — more accurate
SVM (Support Vector Machine)ClassificationFinds best boundary between classes
KNN (K-Nearest Neighbors)ClassificationClassifies based on closest data points
Naive BayesClassificationBased on probability — fast, good for text

2. Unsupervised Learning

In Unsupervised Learning, the algorithm works with unlabeled data — there are no correct answers provided. The model finds hidden patterns and groupings on its own.

  • Use case: Customer segmentation, anomaly detection, market basket analysis
AlgorithmWhat It Does
K-Means ClusteringGroups data into K clusters based on similarity
Hierarchical ClusteringCreates a tree of clusters (dendrogram)
PCA (Principal Component Analysis)Reduces data dimensions while keeping important information

3. Reinforcement Learning

Reinforcement Learning (RL) is where an agent learns by interacting with an environment, taking actions, and receiving rewards or penalties. The goal is to maximize total reward over time.

ComponentRole
AgentThe learner/decision maker
EnvironmentThe world the agent interacts with
ActionWhat the agent does
RewardFeedback — positive (good action) or negative (bad action)
PolicyStrategy the agent follows to decide actions
  • Famous example: AlphaGo by Google DeepMind — defeated world champion in the board game Go
  • Used in: robotics, game playing, self-driving cars, recommendation systems

Deep Learning

Deep Learning is a subset of Machine Learning that uses neural networks with multiple layers (hence “deep”) to learn complex patterns from large amounts of data.

Network TypeFull FormBest ForExample
CNNConvolutional Neural NetworkImage recognition, object detectionIdentifying plant diseases from leaf photos
RNNRecurrent Neural NetworkSequential data (text, time series)Language translation, speech recognition
LSTMLong Short-Term MemoryLong sequences with memoryWeather forecasting, stock prediction

LSTM — Special Focus

LSTM (Long Short-Term Memory) is a special type of RNN designed to remember information for long periods. Regular RNNs struggle with long sequences (they “forget” early inputs), but LSTM solves this with a memory cell that can store, update, or discard information.

  • Has three gates: Forget gate (what to discard), Input gate (what to store), Output gate (what to output)
  • Used in: language models, weather prediction, crop yield forecasting from time-series data

Transfer Learning & Fine-Tuning

Transfer Learning

Transfer Learning uses a model that was pre-trained on a large dataset and applies its learned knowledge to a new, related task. Instead of training from scratch (which needs massive data and computing power), you reuse an existing model.

  • Example: A model trained on millions of general images (ImageNet) can be adapted to identify crop diseases with just a few hundred crop images
  • Saves time, computing resources, and requires less training data

Fine-Tuning

Fine-Tuning is the process of taking a pre-trained model and training it further on your specific dataset. You adjust the model’s weights slightly to perform well on the new task.

  • Transfer Learning = use the model as-is or with minor changes
  • Fine-Tuning = retrain some or all layers on your specific data

Important ML Terminology

TermMeaning
FeatureAn input variable (e.g., rainfall, soil pH, temperature)
LabelThe output/target variable (e.g., crop yield, disease type)
Training DataData used to teach the model
Testing DataUnseen data used to evaluate model performance
Validation DataData used to tune model parameters during training
EpochOne complete pass through the entire training dataset
Batch SizeNumber of samples processed before model updates its weights
Learning RateHow quickly the model adjusts — too high = overshoots, too low = too slow
OverfittingModel memorizes training data but fails on new data (too complex)
UnderfittingModel is too simple to capture patterns in data (poor performance on everything)

What is Big Data?

Big Data refers to extremely large and complex datasets that cannot be processed by traditional database tools. It requires specialized technologies and frameworks.

The 5 V’s of Big Data

VMeaningExample
VolumeHuge amount of dataTerabytes/Petabytes of social media posts
VelocitySpeed of data generationReal-time stock market data, sensor readings
VarietyDifferent types of dataText, images, videos, sensor data mixed together
VeracityAccuracy and trustworthinessIs the data reliable or noisy?
ValueUsefulness of dataCan we extract meaningful insights from it?

Types of Data

TypeDescriptionExamples
StructuredOrganized in rows/columns (tabular)Excel sheets, SQL databases, CSV files
UnstructuredNo predefined formatImages, videos, emails, social media posts
Semi-structuredPartially organizedXML, JSON, HTML files

Hadoop Ecosystem

Apache Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers.

Hadoop Core Components

ComponentFull FormRole
HDFSHadoop Distributed File SystemDistributed storage — stores data across multiple machines
MapReduceProcessing framework — processes data in parallel
YARNYet Another Resource NegotiatorResource management — allocates CPU, memory to tasks

HDFS Architecture

ComponentRole
NameNodeMaster node — stores metadata (file names, locations, block info). Does NOT store actual data
DataNodeSlave nodes — store actual data blocks. Report to NameNode periodically
Secondary NameNodeCreates checkpoints of NameNode metadata (NOT a backup NameNode)
  • Data is split into blocks (default 128 MB) and distributed across DataNodes
  • Each block is replicated 3 times (default) for fault tolerance

MapReduce — How It Works

MapReduce processes data in two phases:

PhaseWhat HappensExample (Word Count)
MapSplits input data and processes each piece independentlyEach mapper counts words in its chunk
Shuffle & SortGroups intermediate results by keyAll counts for same word grouped together
ReduceAggregates grouped results to produce final outputSums up counts for each word

Apache Spark

Apache Spark is a faster alternative to MapReduce that processes data in-memory (RAM) instead of reading/writing to disk repeatedly.

  • Up to 100x faster than MapReduce for in-memory processing
  • Supports batch processing, streaming, machine learning (MLlib), and graph processing
  • Written in Scala, supports Java, Python, R

Data Analytics Types

TypeQuestion It AnswersExample
DescriptiveWhat happened?Last year’s crop production report
DiagnosticWhy did it happen?Why did yield drop in Kharif season?
PredictiveWhat will happen?Forecasting next season’s rainfall
PrescriptiveWhat should we do?Recommending optimal fertilizer quantity

Data Storage & Mining

ConceptDescription
Data WarehouseCentral repository of structured, historical data for analysis and reporting
Data LakeStorage for raw data in any format (structured + unstructured) — process later
Data MiningDiscovering patterns and insights from large datasets using statistical methods
Visualization ToolsTableau, Power BI — create charts, dashboards, interactive reports

Common ML Algorithms — Exam Quick Reference

AlgorithmTypeHow It WorksBest For
KNN (K-Nearest Neighbors)ClassificationClassifies a data point based on the majority class of its K closest neighborsPattern recognition, recommendation systems
Naive BayesClassificationUses Bayes’ theorem — calculates probability of each class, assumes features are independentSpam filtering, text classification, sentiment analysis
SVM (Support Vector Machine)ClassificationFinds the optimal hyperplane (boundary) that best separates two classes with maximum marginImage classification, face detection
Decision TreeBothTree-like model — splits data at each node based on a feature condition (if-then rules)Easy to interpret, medical diagnosis
Random ForestBothEnsemble method — builds many decision trees and takes majority voteMore accurate than single tree, reduces overfitting

Exam tip: Random Forest = many Decision Trees voting together (like asking 100 experts instead of 1).


Neural Network Activation Functions

Activation functions determine whether a neuron should be “activated” (fire) or not. They introduce non-linearity into the network.

FunctionOutput RangeUse
Sigmoid0 to 1Binary classification (yes/no) — output layer
ReLU (Rectified Linear Unit)0 to ∞Most popular for hidden layers — fast, avoids vanishing gradient
Softmax0 to 1 (probabilities summing to 1)Multi-class classification output — e.g., classifying crop disease among 10 types

Important Training Terms

TermMeaning
EpochOne complete pass through the entire training dataset
Batch SizeNumber of training samples used in one update of model weights
IterationNumber of batches needed to complete one epoch
Learning RateControls how much weights are adjusted per update

Example: 1,000 samples, batch size = 100 → 10 iterations per epoch.


Model Evaluation Metrics

MetricWhat It Measures
AccuracyOverall correct predictions / total predictions
PrecisionOf all predicted positives, how many are actually positive? (avoids false alarms)
Recall (Sensitivity)Of all actual positives, how many were correctly identified? (avoids missing cases)
F1 ScoreHarmonic mean of Precision and Recall — balances both

Exam tip: Precision = “Don’t cry wolf”; Recall = “Don’t miss any wolf”; F1 = balance of both.


Big Data Tools — Additional

Tool/ConceptWhat It Does
Apache KafkaReal-time data streaming platform — handles millions of events per second (used by LinkedIn, Uber)
ETL (Extract, Transform, Load)Process for moving data from source systems into a Data Warehouse: Extract raw data → Transform (clean, format) → Load into warehouse

Data Lake vs Data Warehouse

FeatureData LakeData Warehouse
Data TypeRaw, unprocessed data (any format)Processed, structured data
SchemaSchema-on-read (structure applied when reading)Schema-on-write (structure defined before loading)
UsersData scientists, ML engineersBusiness analysts, managers
CostLower (stores everything cheaply)Higher (requires processing before storage)
ExampleAWS S3, Azure Data LakeAmazon Redshift, Google BigQuery

Exam tip: Data Lake = dump everything raw; Data Warehouse = organized, ready-to-query data.


Key Takeaways

  • Supervised Learning uses labeled data: KNN (nearest neighbors), Naive Bayes (probability), SVM (hyperplane boundary), Decision Tree (if-then rules), Random Forest (ensemble of trees)
  • Unsupervised Learning finds patterns in unlabeled data (K-Means clustering, PCA dimensionality reduction)
  • Reinforcement Learning: agent learns by rewards/penalties (AlphaGo by DeepMind)
  • CNN = images, RNN = sequences, LSTM = long sequences with memory (3 gates: forget, input, output)
  • Activation functions: Sigmoid (0-1, binary), ReLU (0-infinity, hidden layers), Softmax (probabilities, multi-class)
  • Training terms: Epoch = one full pass through data; Batch Size = samples per weight update; Learning Rate = adjustment speed
  • Model evaluation: Precision (avoid false alarms), Recall (don’t miss cases), F1 Score (harmonic mean of both)
  • Transfer Learning reuses pre-trained models; Fine-Tuning retrains them on specific data
  • Overfitting = too complex (memorizes); Underfitting = too simple (misses patterns)
  • Big Data 5 V’s: Volume, Velocity, Variety, Veracity, Value
  • Hadoop: HDFS (storage, 128 MB blocks, 3x replication) + MapReduce (processing) + YARN (resource management)
  • NameNode = master (metadata only); DataNode = slave (actual data blocks)
  • Spark is 100x faster than MapReduce — processes data in-memory (RAM)
  • Apache Kafka = real-time data streaming (millions of events/sec)
  • ETL = Extract, Transform, Load — process for moving data into Data Warehouse
  • Data Lake = raw data, any format, schema-on-read; Data Warehouse = processed structured data, schema-on-write
  • Visualization tools: Tableau, Power BI; Analytics: Descriptive/Diagnostic/Predictive/Prescriptive

Summary Cheat Sheet

ConceptKey Details
Machine LearningComputers learn from data without explicit programming
Supervised LearningLabeled data — regression & classification
Linear RegressionPredicts continuous values (price, yield)
Decision TreeTree-like decisions, easy to interpret
Random ForestMultiple decision trees combined
SVMFinds best boundary between classes
KNNClassifies based on nearest neighbors
Naive BayesProbability-based, fast, good for text
Unsupervised LearningUnlabeled data — clustering & dimensionality reduction
K-MeansGroups data into K clusters
PCAReduces dimensions, keeps key info
Reinforcement LearningAgent + Environment + Reward + Policy
AlphaGoRL example by Google DeepMind
CNNImages — Convolutional Neural Network
RNNSequences — Recurrent Neural Network
LSTMLong sequences with memory cells (3 gates)
Transfer LearningReuse pre-trained model for new task
Fine-TuningRetrain pre-trained model on specific data
OverfittingModel too complex, fails on new data
UnderfittingModel too simple, poor on all data
Big Data 5 V’sVolume, Velocity, Variety, Veracity, Value
Structured DataRows/columns (Excel, SQL)
Unstructured DataNo format (images, videos, emails)
HDFSHadoop distributed storage system
NameNodeMaster — stores metadata, not data
DataNodeSlave — stores actual data blocks
MapReduceMap (split & process) → Reduce (aggregate)
Apache SparkIn-memory processing, 100x faster than MapReduce
YARNResource manager in Hadoop
Data WarehouseStructured historical data for analysis
Data LakeRaw data in any format
Data MiningDiscover patterns from large datasets
Tableau / Power BIData visualization tools
Descriptive AnalyticsWhat happened?
Diagnostic AnalyticsWhy did it happen?
Predictive AnalyticsWhat will happen?
Prescriptive AnalyticsWhat should we do?
EpochOne complete pass through training dataset
Batch SizeSamples processed before model weight update
Learning RateHow much weights adjust per update
PrecisionOf predicted positives, how many are correct?
RecallOf actual positives, how many were found?
F1 ScoreHarmonic mean of Precision and Recall
SigmoidActivation 0-1, binary classification output
ReLUActivation 0-infinity, most popular for hidden layers
SoftmaxActivation probabilities summing to 1, multi-class output
Apache KafkaReal-time data streaming — millions of events/sec
ETLExtract, Transform, Load — data into warehouse
Data LakeRaw data, any format, schema-on-read (AWS S3)
Data WarehouseProcessed structured data, schema-on-write (Redshift, BigQuery)

Knowledge Check

Take a dynamically generated quiz based on the material you just read to test your understanding and get personalized feedback.

Lesson Doubts

Ask questions, get expert answers

Lesson Doubts is a Pro feature.Upgrade