Graph Database (Neo4j), Fraud Detection (Panama Papers) and Python
“making explicit what it is implicit”
“Data Science in Cerved” by Stefano Gatti and Nunzio Pellegrino, Cerved
On 25th September 2017 at Cerved were hosted talks about Graph Database Technology (Neo4j).
Stefano Gatti and Nunzio Pellegrino showed a talk both Cerved and an introduction on graph model properties.
Cerved is one of the most important data-driven company located in Italy, it’s working in three sectors: credit information, marketing solutions, credit management and it manages Chamber of Commerce data, Official data, Proprietary data, Open data and Web data.
Cerved organization can be seen under three layers. The first one is data company, the base layer of data. The second one is algorithms company, working on data provides value through algorithms and the third one is solutions company, it produces customer solutions as the last process.
Graph technology was born not only for data analysis, but also for the ability to create new products and its power to link certain types of data, to provide a new way of analyzing and visualizing data, such as, for example, with network analysis. The property graph contains connected entities (the nodes) which can hold any number of attributes (key-value-pairs). Nodes can be tagged with labels (person, company) representing their different roles in your domain. In addition to contextualizing node and relationship properties, labels may also serve to attach metadata, index or constraint information, to certain nodes. Relationships provide directed, named semantically relevant connections between two node-entities. A relationship always has a direction, a type, a start node, and an end node. Like nodes, relationships can have any properties. In most cases, relationships have quantitative properties, such as weights, costs, distances, ratings, time intervals, or strengths. As relationships are stored efficiently, two nodes can share any number or type of relationships without sacrificing performance. Note that although they are directed, relationships can always be navigated regardless of direction. With RDBMS we look tables, rows and columns, instead graph models own nodes, relationships and properties, they are part of the NoSQL ecosystem with the aim of managing a very large and correlated data. Graph model is flexible and interactive with declarative or imperative language and scalable in horizontal way; advantages compared to a relational database are the expressiveness, simplicity and additivity of the model, because it allows to add nodes, relationships in a flexible way. Each node acts as a pointer to the other nodes, which allows at the query time to rely on only the portion that is being interrogated, since the relationships are historicized, rendered persistent. Anyway the real advantage lies in exploring the data, they represent the back end of web applications. Look at the video. If you like to try a Neo4j graph model, click here
“Panama Papers and Next Generation Fraud Detection” by Stefan Kolmar, Neo4j
In the followed speech Stefan Kolmar showed an use case of Neo4j: Panama Papers.
The Panama Papers are 11,5 million leaked documents that detail financial and attorney client information for more than 214,488 offshore entities. The documents belonged to the Panamanian law firm and corporate service provider Mossack Fonseca, were leaked in 2015 by an anonymous source. ICIJ published these documents and in this way exposed how offshore tax havens are used at scale by elites from anywhere in the world: how billionaires, sports stars and politicians all use tax havens to hide their money. ICJ (International Consortium of Investigative Journalists) are a network of around 200 journalists in more than 65 countries that work together to do cross-border investigations and issues of global concern that speak about systemic problems that are happening in society. The Panama Papers is big around 2,6 Terabytes of data from 3 million files and the first question was how to manage with so much information? Technology became crucial.
The collecting process started from raw files (emails, excel files…) preprocessed in metadata and raw text before to collect them in a database useful for search and discovery purpose. Investigators used optical character recognition to make millions of scanned documents text-searchable and other analytical tools to extract metadata documents.
Then were connected all the information together using leaked databases, creating a graph of nodes and edges in Neo4j thus made it accessible using Linkurious’ visualization application. In this involved process context is really important to determine entities and their relationships, potential entity and relationship properties, sources for those entities and their properties. Look at the video.
“Let Neo4j chat with Python, it’s easy!” by Fabio Lamanna, Larus BA
In the third speech Fabio Lamanna from Larus presented how to use Python with Neo4j. One Python usage is cleansing activity on raw data to make it available for Neo4j, otherwise another usage is in the continuous workflow combined with Neo4j to make queries and fast analysis. The first use case showed regard a project between Università Cattolica di Milano and Copenaghen Polytechnic about a database of scientific papers to find and categorize publications, journals by topics, unveiling collaboration patterns among researchers and recommendations. The item is about Natural Language Processing (NLP) where Python with packages as Pandas, TextBlob and Pattern works on raw data and makes it available to load on Neo4j.
The last use case regarded Airport Mobility with Twitter Data, the goal is to study these data sources for mobility passengers by geolocation from their twitter messages and so the use of the space within airport terminals. There were collected data from 25 busiest airports in Europe in the last three years. In this example Python not only is used to cleaning data but also interacts with Neo4j directly from Python with py2neo package to realize plot and tables. Look at the video.
Written by Claudio G. Giancaterino
Originally published on October 2017