Author: eca

15-20 minutes. Not always. But very often.

As mentioned in the first articles of this blog, Matrix was born with the idea of monitoring the Internet, not only in the classic sense used in the field of cyber security, but more broadly with a view to following trajectories to hypothesize trajectories and related scenarios. With this goal in mind, for years, I have been working on different fronts, one of the main ones being the detection of new entities, mainly Internet domains.

With the latest release a few weeks ago I am happy to note an important improvement: the detection time of a new registered Internet domain has become much shorter. When I say “detection of a new registered Internet domain” I am not referring to the classic feed that reports thousands of indicators per minute, like CTL or pDNS. I am referring instead to a timely notification that says “this domain was registered 15 minutes ago”.

The difference between having a streaming within which to go and look for something, compared to having a precise notification, for Matrix is substantial.

When I write “15 minutes” I do not write it by chance, 15-20 minutes is in fact the amount of time needed to receive the notification of a new domain on the Elasticsearch console. On the underlying feeds, those that populate the Elasticsearch indexes, this information arrives about 2-3 minutes earlier. It is therefore a matter of registering a domain on a provider and, with some exceptions, seeing it appear on one of the Matrix feeds in a few minutes. Obviously there are circumstances in which this does not happen and you have to wait a few hours.

Today I did a test, I registered three domains and all three arrived on the console between 15 and 20 minutes. I do not know if this is a great thing for everyone, for me it is and it is also very useful for the feed that I publish on Urlscan where I often see that the notifications arriving from Matrix are the first to identify some types of indicators, especially phishing sites.

Just to consider this article scientific and not marketing, I report here the “journey” of a domain in the bowels of the Matrix. From registration to being marked as a “new registered domain”.

The registered domain is “matrix-speed-test.com”

The one hour difference is obviously due to the time zone here in Italy.

The interesting aspect of this Kibana screenshot is not in the line extracted from the “scirocco” feed, this only reports the presence of the domain within the CTL feed. This feed is easily accessible and knowing the name of a domain it is easy to filter the feed. The interesting line is the one related to the “libeccio” feed: this feed in fact contains the newly registered domains.

December 18, 2024
Lucene has finally arrived, thanks to Elastricsearch

Over the years I have had the opportunity to fall in love with several software products: C, Linux, Java, .NET, HTTPD, Linux, and others (did I say Linux?), in addition to those mentioned there is also Lucene. Lucene is one of the many fantastic products of the Apache community. It is a library to manage (save, index and search) content. What makes Lucene superior to many other storage systems is its ability to scale.

A few years ago I had started working on an integration of Lucene within Matrix, unfortunately time was short and I put the project aside until I abandoned it to work on other aspects of the project. This year, taking advantage of the weeks of August, I had decided to take up again an aspect that was absolutely lacking in Matrix: the human-machine interaction. Until then the only possible interaction was “grep” and SQL. Obviously this did not make some tasks very comfortable and in some cases it was necessary to create scripts or programs to extract information and give it a decent form.

The situation was quite simple, I had a lot of information, stored in different locations, using different technologies. So I re-evaluated the old idea of storing everything in Lucene. The problem was that Lucene is great but I was missing a lot of components on top, including, above all, the graphical interface. So I started looking for products based on Lucene and realized that Elasticsearch was also based on Lucene. I had used Elasticsearch in the past in a couple of projects and honestly I knew little about it. So I spent a few days studying and then I moved on to practice. I installed a cluster with two nodes and started loading data.

I immediately started to upload the feeds of Zefiro (NRD), Scirocco (CTL) and Libeccio. In reality, the feeds produced by Libeccio are two: one contains the Whois queries related to the domains downloaded by Scirocco, the second contains the NRD derived from the processing of the Whois responses.

After a few weeks I also started uploading the analyses produced by Smith and subsequently the list of expired domains produced by Zefiro.

Adopting Elasticsearch has allowed me to easily analyze the data produced by Matrix and share this information with colleagues who use it for different use cases.

The loading methods used are mainly two: Filebeat and API. Using one of these two methodologies each subsystem loads its own dedicated index. To limit the size of each index (according to the documentation it is advisable to stay under 50GB) some indexes are rotated daily.

A few weeks ago I added a third node to the cluster and within a few hours the cluster completed the alignment procedure, completely automatically. Awesome.

Currently after two months of uploads into the cluster we have about 6B documents.

Searches continue to run within milliseconds and there have never been any performance issues. So far everything is running fine.

I tend to be very paranoid in adopting products and very often after careful analysis I tend to avoid doing so. Maybe it’s my age, maybe it’s my distrust of the average programmer, many times I realize that I do it quicker to develop, rather than find, understand, test and integrate available products. I understand the principle of not reinventing the wheel, but if attached to the wheel I find a circus with elephants, I prefer to reinvent the wheel. I don’t want to bring up every time the example of “J2EE” application servers to run some fucking JSP…

This is not the case: Elasticsearch is really cool!

December 14, 2024
Architecture schema

In the previous articles we described the project and the components that were created to implement the solution. Below you will find a diagram showing the interactions between the components.

April 23, 2024
From Twitter to Urlscan
Having decided to try Matrix in the field of cyber security we found ourselves faced with several choices. One of the main ones was what to do with the evidence found. Once we identified a threat, what did we do with it? Working in the sector, the first choice was Netcraft, it would have been an excellent tool to evaluate our results. After a while we realized that Netcraft was good for online threats or those that could be easily detected anyway. For all other threats, especially those in the process of being installed, more was needed. We therefore decided to publish all reports on Twitter. Emiliano had a profile that he didn’t use and so we started with that. In a few months the quantity and quality of the reports became important and the “ecarlesi” profile became frequented by analysts and operators in the cyber security sector.

This led to learning about new realities and creating new opportunities.

Below you can find some links that describe the project and its integration with Twitter:
With the arrival of Musk and the related havoc on Twitter, the “ecarlesi” account was suspended for violating the rules on counterfeiting. The profiles of flat-Earthers and Trump fans were probably more welcome than those who collaborate on network security 😀

At that time we had begun to be interested in Urlscan, so we decided to move the publication of our information to this platform. This brought a great advantage, some of the tasks we previously had to do to publish a decent report on Twitter were now done better by Urlscan! We only had to communicate the URL and only thought about making the screenshot, performing the Whois request, etc.

The Urlscan team was very kind and helpful in supporting us in the integration phase.

Since the tweets from the Twitter profile “ecarlesi” were loaded into Urlscan, we decided to keep the tag “@carlesi”. You can then access the feed produced by Matrix and published in urlscan via this search link:

https://urlscan.io/search/#task.tags:%22@ecarlesi%22

In less than a year, Matrix sent nearly 2.5 million reports to Urlscan, all publicly available. We think this is a great contribution to the cyber security sector, also achieved thanks to Urlscan.
April 22, 2024
Architecture introduction
The first consideration we found ourselves making was that the platform needed to manipulate a significant amount of data and that we did not have a sufficient budget to work with standard solutions, mainly due to the cost of the hardware. We had to invent something cheap, reliable and high-performance. Obviously the cost-effectiveness constraint was the most stringent. When you have a budget everything is easier.

Some key words were immediately decided that would be the basis of all choices: scalable, asynchronous, lean. We had to avoid components that were too complex to design, write and maintain. Instead, we needed to create an ecosystem of components that communicated with each other to create something more complex. The components had to be able to run on our systems or on external systems and had to be as stupid as possible. Little logic on board and instructed by the center.

Within the platform we identified different types of components: data producers, processors, aggregators, consumers. Each component could be part of one or more categories.
The development began with the first data producers.

The first was Zefiro, the component that creates the list of recently registered domains. Zefiro uses zone files to identify new domains (and those that are deleted).

After Zefiro came Scirocco, which did similar work to Zefiro but using the Certificate Transparency Log. From here we could extract the new domains, in some cases before Zefiro found them. Obviously Zefiro had a more global vision. Using the two ensured good coverage. The component that extracts recently registered domains from the CTL feed is called Libeccio. To improve it further, Levante was born.

Levante uses OSINT channels to collect domains and paths which in some cases are not intercepted by Zefiro and Scirocco.

The last member of the data producer category was Miniluv. Miniluv processes data from the other three sources and filters and prioritizes it.

These components handle approximately 20GB per hour. If they were videos it would be a small thing, but they are text strings a few dozen characters long. Lots of stuff.

In addition to products we have consumers. The main consumer is Smith. It is an agent that analyzes domains looking for patterns. An investigator. Smith takes a set of analyzes to run, runs them and for each analysis generates a report that returns to the backend. Here the report is saved and possibly sent to the component which notifies appropriately.

To briefly understand these are the main components of the Matrix. The full list will come in a later post. However, what we want to share is the architectural approach. All these components communicate and to allow them to do so we could adopt dozens of solutions. At first we implemented everything as microservices basing communication on REST calls. The approach immediately proved to be inefficient. We then switched to using queues by adopting a broker. Operation was better and the amount of resources consumed by the broker was significant. We therefore decided to create a custom component. Thus Tramontana came to light. It is a server that moves and transforms data. Some concepts were taken from BizTalk, but this had to work and cost little 🙂

We then converted all the components to allow them to be consumed by Tramontana. The transformation was simple given that Tramontana monitors storage and moves files, possibly transforming them based on their destination. Each component then had to receive input in a file and produce results in files. These files would have been moved and transformed by Tramontana. This choice brings several benefits:
- The components were emptied of business logics linked to other components and transport.
- Traceability was centralized in a component, here in fact we can follow the flow of data from a single console.
- The components do their specific work without caring who will use their data.
- Tramontana reaches the components and therefore there is no need for this component to be published or exhibited.
- Everything is asynchronous. If a component stops, including the Tramontana itself, everything continues to work, even if partially since some data does not arrive; however, nothing stops and when the component is restarted everything returns to 100% operation.
April 21, 2024
Origins of the project

The idea of Matrix was born in the cyber security field and for this reason the first declination of the project operates in that field: in this field the need to intervene quickly is fundamental.

The initial problem posed to me by a customer was simple: How do I find a threat before my customers find it? How do I prevent my clients from becoming victims?

The need was clear, find threats before they could be spread.

Thus began the design of a platform that would allow the identification of new threats or potential new threats that were not yet online.

The first phase of the project was promising in the opinion of those working on it. Unfortunately, a lack of resources limited the platform’s analysis capacity. So we tried to look for sponsors. We started contacting larger companies to see if they might be interested in helping us grow, unfortunately and fortunately no one showed interest. Unfortunately because the project took longer to reach a sufficient level of maturity. Luckily, because that lack of resources led us to review some very onerous processes in favor of others that were much more efficient.

Working on the platform and the data produced we realized that the incoming data allowed us to identify threats but that these were obviously a small percentage of the total. The rest was data relating to the real world, the one that produces, the one that has fun, the one we experience when we are not dealing with security. A new world had opened up, that of newly born or newly created realities, these realities could be seen before the others. We thus began to monitor some areas, creating some case studies: the main one was a case study relating to the world of mountain bikes. By activating some indicators we received periodic notifications (several times a day) which allowed us to intercept new activities, groups of cyclists and companies in the sector, before they became public or famous.

The success of this test led us to work even more intensely on the project and we decided to use the world of cyber security as a platform showcase.

In the next posts we will talk more about how the Matrix works and its evolutions.

April 20, 2024