Course Outline
Day 01
Overview of Big Data Business Intelligence for Criminal Intelligence Analysis
- Case Studies from Law Enforcement - Predictive Policing.
- Big Data adoption rates in Law Enforcement Agencies and their alignment with Big Data Predictive Analytics for future operations.
- Emerging technology solutions such as gunshot sensors, surveillance video, and social media.
- Leveraging Big Data technology to mitigate information overload.
- Integrating Big Data with Legacy data.
- Foundational understanding of enabling technologies in predictive analytics.
- Data Integration & Dashboard visualization.
- Fraud management.
- Business Rules and Fraud detection.
- Threat detection and profiling.
- Cost-benefit analysis for Big Data implementation.
Introduction to Big Data
- Key characteristics of Big Data: Volume, Variety, Velocity, and Veracity.
- MPP (Massively Parallel Processing) architecture.
- Data Warehouses – static schema, slowly evolving datasets.
- MPP Databases: Greenplum, Exadata, Teradata, Netezza, Vertica, etc.
- Hadoop-Based Solutions – no structural constraints on datasets.
- Typical pattern: HDFS, MapReduce (crunch), retrieval from HDFS.
- Apache Spark for stream processing.
- Batch processing – suited for analytical/non-interactive tasks.
- Volume: CEP streaming data.
- Typical choices – CEP products (e.g., Infostreams, Apama, MarkLogic, etc.).
- Less production-ready options – Storm/S4.
- NoSQL Databases – (columnar and key-value): Best suited as analytical adjuncts to data warehouses/databases.
NoSQL Solutions
- KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB).
- KV Store - Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB.
- KV Store (Hierarchical) - GT.m, Cache.
- KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord.
- KV Cache - Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracotta.
- Tuple Store - Gigaspaces, Coord, Apache River.
- Object Database - ZopeDB, DB40, Shoal.
- Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris.
- Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI.
Varieties of Data: Introduction to Data Cleaning Issues in Big Data
- RDBMS – static structure/schema; does not promote an agile, exploratory environment.
- NoSQL – semi-structured; provides enough structure to store data without an exact schema prior to storage.
- Data cleaning issues.
Hadoop
- When to select Hadoop?
- STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not ideal for active exploration).
- SEMI STRUCTURED data – difficult to handle using traditional solutions (DW/DB).
- Warehousing data = HUGE effort and static even after implementation.
- For variety & volume of data, crunched on commodity hardware – HADOOP.
- Commodity H/W needed to create a Hadoop Cluster.
Introduction to Map Reduce /HDFS
- MapReduce – distribute computing over multiple servers.
- HDFS – make data available locally for the computing process (with redundancy).
- Data – can be unstructured/schema-less (unlike RDBMS).
- Developer responsibility to make sense of data.
- Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS.
Day 02
Big Data Ecosystem -- Building Big Data ETL (Extract, Transform, Load) -- Which Big Data Tools to use and when?
- Hadoop vs. Other NoSQL solutions.
- For interactive, random access to data.
- Hbase (column-oriented database) on top of Hadoop.
- Random access to data but restrictions imposed (max 1 PB).
- Not ideal for ad-hoc analytics; good for logging, counting, time-series.
- Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access).
- Flume – Stream data (e.g., log data) into HDFS.
Big Data Management System
- Moving parts, compute nodes start/fail: ZooKeeper - For configuration/coordination/naming services.
- Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain.
- Deploy, configure, cluster management, upgrade etc (sys admin): Ambari.
- In Cloud: Whirr.
Predictive Analytics -- Fundamental Techniques and Machine Learning based Business Intelligence
- Introduction to Machine Learning.
- Learning classification techniques.
- Bayesian Prediction – preparing a training file.
- Support Vector Machine.
- KNN p-Tree Algebra & vertical mining.
- Neural Networks.
- Big Data large variable problem – Random Forest (RF).
- Big Data Automation problem – Multi-model ensemble RF.
- Automation through Soft10-M.
- Text analytic tool – Treeminer.
- Agile learning.
- Agent-based learning.
- Distributed learning.
- Introduction to Open Source Tools for predictive analytics: R, Python, Rapidminer, Mahout.
Predictive Analytics Ecosystem and its application in Criminal Intelligence Analysis
- Technology and the investigative process.
- Insight analytic.
- Visualization analytics.
- Structured predictive analytics.
- Unstructured predictive analytics.
- Threat/fraudster/vendor profiling.
- Recommendation Engine.
- Pattern detection.
- Rule/Scenario discovery – failure, fraud, optimization.
- Root cause discovery.
- Sentiment analysis.
- CRM analytics.
- Network analytics.
- Text analytics for obtaining insights from transcripts, witness statements, internet chatter, etc.
- Technology-assisted review.
- Fraud analytics.
- Real-Time Analytic.
Day 03
Real Time and Scalable Analytics Over Hadoop
- Why common analytic algorithms fail in Hadoop/HDFS.
- Apache Hama – for Bulk Synchronous distributed computing.
- Apache SPARK – for cluster computing and real-time analytic.
- CMU Graphics Lab2 – Graph-based asynchronous approach to distributed computing.
- KNN p – Algebra-based approach from Treeminer for reduced hardware cost of operation.
Tools for eDiscovery and Forensics
- eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance.
- Predictive coding and Technology Assisted Review (TAR).
- Live demo of vMiner for understanding how TAR enables faster discovery.
- Faster indexing through HDFS – Velocity of data.
- NLP (Natural Language processing) – open source products and techniques.
- eDiscovery in foreign languages – technology for foreign language processing.
Big Data BI for Cyber Security – Getting a 360-degree view, speedy data collection and threat identification
- Understanding the basics of security analytics – attack surface, security misconfiguration, host defenses.
- Network infrastructure / Large datapipe / Response ETL for real-time analytic.
- Prescriptive vs predictive – Fixed rule-based vs auto-discovery of threat rules from Meta data.
Gathering disparate data for Criminal Intelligence Analysis
- Using IoT (Internet of Things) as sensors for capturing data.
- Using Satellite Imagery for Domestic Surveillance.
- Using surveillance and image data for criminal identification.
- Other data gathering technologies – drones, body cameras, GPS tagging systems, and thermal imaging technology.
- Combining automated data retrieval with data obtained from informants, interrogation, and research.
- Forecasting criminal activity.
Day 04
Fraud Prevention BI from Big Data in Fraud Analytics
- Basic classification of Fraud Analytics – rules-based vs predictive analytics.
- Supervised vs unsupervised Machine learning for Fraud pattern detection.
- Business-to-business fraud, medical claims fraud, insurance fraud, tax evasion, and money laundering.
Social Media Analytics – Intelligence gathering and analysis
- How criminals use Social Media to organize, recruit, and plan.
- Big Data ETL API for extracting social media data.
- Text, image, meta data, and video.
- Sentiment analysis from social media feeds.
- Contextual and non-contextual filtering of social media feeds.
- Social Media Dashboard to integrate diverse social media.
- Automated profiling of social media profiles.
- Live demo of each analytic will be given through the Treeminer Tool.
Big Data Analytics in image processing and video feeds
- Image Storage techniques in Big Data – Storage solutions for data exceeding petabytes.
- LTFS (Linear Tape File System) and LTO (Linear Tape Open).
- GPFS-LTFS (General Parallel File System - Linear Tape File System) – layered storage solution for Big image data.
- Fundamentals of image analytics.
- Object recognition.
- Image segmentation.
- Motion tracking.
- 3-D image reconstruction.
Biometrics, DNA and Next Generation Identification Programs
- Beyond fingerprinting and facial recognition.
- Speech recognition, keystroke (analyzing a user's typing pattern), and CODIS (combined DNA Index System).
- Beyond DNA matching: using forensic DNA phenotyping to construct a face from DNA samples.
Big Data Dashboard for quick accessibility of diverse data and display :
- Integration of existing application platforms with Big Data Dashboards.
- Big Data management.
- Case Study of Big Data Dashboards: Tableau and Pentaho.
- Use Big Data apps to push location-based services in Government.
- Tracking system and management.
Day 05
How to justify Big Data BI implementation within an organization:
- Defining the ROI (Return on Investment) for implementing Big Data.
- Case studies for saving Analyst Time in data collection and preparation – increasing productivity.
- Revenue gain from lower database licensing costs.
- Revenue gain from location-based services.
- Cost savings from fraud prevention.
- An integrated spreadsheet approach for calculating approximate expenses vs. Revenue gain/savings from Big Data implementation.
Step-by-step procedure for replacing a legacy data system with a Big Data System
- Big Data Migration Roadmap.
- What critical information is needed before architecting a Big Data system?
- What are the different ways for calculating Volume, Velocity, Variety, and Veracity of data?
- How to estimate data growth.
- Case studies.
Review of Big Data Vendors and review of their products.
- Accenture.
- APTEAN (Formerly CDC Software).
- Cisco Systems.
- Cloudera.
- Dell.
- EMC.
- GoodData Corporation.
- Guavus.
- Hitachi Data Systems.
- Hortonworks.
- HP.
- IBM.
- Informatica.
- Intel.
- Jaspersoft.
- Microsoft.
- MongoDB (Formerly 10Gen).
- MU Sigma.
- Netapp.
- Opera Solutions.
- Oracle.
- Pentaho.
- Platfora.
- Qliktech.
- Quantum.
- Rackspace.
- Revolution Analytics.
- Salesforce.
- SAP.
- SAS Institute.
- Sisense.
- Software AG/Terracotta.
- Soft10 Automation.
- Splunk.
- Sqrrl.
- Supermicro.
- Tableau Software.
- Teradata.
- Think Big Analytics.
- Tidemark Systems.
- Treeminer.
- VMware (Part of EMC).
Q/A session.
Requirements
- Knowledge of law enforcement procedures and data systems.
- Basic understanding of SQL/Oracle or relational databases.
- Basic understanding of statistics (at the spreadsheet level).
Target Audience
- Law enforcement specialists with a technical background.
Testimonials (2)
basics and loved the prepared documents and exercises
Rekha Nallam - GE Medical Systems Polska Sp. z o.o.
Course - Introduction to Predictive AI
Deepthi was super attuned to my needs, she could tell when to add layers of complexity and when to hold back and take a more structured approach. Deepthi truly worked at my pace and ensured I was able to use the new functions /tools myself by first showing then letting me recreate the items myself which really helped embed the training. I could not be happier with the results of this training and with the level of expertise of Deepthi!