Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Generic searchable database of leaked databases
#1
Hi guys,
Love the site and hope this community takes off.
I am a programmer and database expert and have been working on something that may interest some of you. I am creating an open ElasticSearch cluster which would be an ever-growing accumulation of all known database leaks. This is then incredibly searchable. It can then be integrated with a UI to display statistics or to expose a faceted filtering display to a user, offering a powerful search experience.

So do people think this is a good idea? So far I have just a few leaks in there and the fields I have stored are username, email, password hash, salt. Would be very interested in hearing feedback as to what other fields people would be interested in, or any other feedback/suggestions on the idea and how it should be implemented.
Reply
#2
ElasticSearch is really good, I have implemented it with a personal webapp I made. I also use it to search through dumps and I must say, both the import and query speeds are impressive.

Let me know if you need any help, I have quite a bit of experience with ES.
Reply
#3
(02-19-2017, 06:32 PM)LSDoom Wrote:  ElasticSearch is really good, I have implemented it with a personal webapp I made. I also use it to search through dumps and I must say, the both the import and query speeds are impressive.

Let me know if you need any help, I have quite a bit of experience with ES.

Hey, yep I had noticed the query speeds to be very impressive for the current approx 200k items I have in there and the aggregations are very powerful. The most time-consuming part for me is the running of the python script to pull out the relevant data and PUT each item into ES. Running for that 200k I mentioned took about a full day. I have two thoughts I'd be interested in hearing your opinion on:

They say ES scales incredibly well, but is that within reason I wonder. Could I have millions more items in there and not expect much of a performance drop?

One other thing I noticed was that I wasn't able to do aggregations on some fields because their mappings had not set fielddata to be true. Repairing this would involve deleting all the items, remapping the index and then running the script all over again - or is there a quicker solution?
Reply
#4
My script imports millions of entires in just minutes. I have only tested query speeds with about 100 million entries in total (thats all I had the storage space for currently). I didnt notice any drop in performance going up to that point, and I dont imagine seeing it either. Query speeds are super fast, less than 0.2 seconds.

And I do not believe there is a way to remap and add analyzers to an index without dropping all the items in it. However, this is not really a problem seeing the import script is so fast.

The only downside to ES is that going from plaintext/csv file and importing into an index takes a lot of storage. Maybe 4-6 times as much as just storing the file as a regular .txt or .csv. This has to do with the way it indexs and is the price to pay for very fast query speeds.
Reply
#5
I guess I need to revise my python scripts big time as clearly how I am pulling out the relevant data must be very inefficient!

Once I can improve this import speed I will go ahead and import a lot more databases then as scale is clearly fantastic. When you mention the memory concerns, how much GB are we talking for the 100 million many entries? Just curious about which pay-tier I would need in terms of getting a VPS eventually.
Reply
#6
(02-19-2017, 07:05 PM)Entropy Wrote:  I guess I need to revise my python scripts big time as clearly how I am pulling out the relevant data must be very inefficient!

Once I can improve this import speed I will go ahead and import a lot more databases then as scale is clearly fantastic. When you mention the memory concerns, how much GB are we talking for the 100 million many entries? Just curious about which pay-tier I would need in terms of getting a VPS eventually.

I suggest looking into Logstash instead of python, its made by the same people that made ES and it works great for importing. I can send you my script and you can slightly modify that to fit your needs.

I recently removed stuff from my cluster, so I cant remember exactly how much storage it was using. However, multiply the uncompressed file size with 5 and you should get a pretty accurate number.

Also the type of storage makes a huge difference in import and query speeds. SSD is going to be a lot faster than a regular HDD and when I say "a lot faster", I mean A LOT FASTER. So I suggest getting a large SSD VPS or a VPS running at least 3 disks of RAID5.
Reply
#7
Will take a look at logstash, had heard it mentioned a lot but did not realize it was for the handling of data parsing needs. Thanks for the information and advice!
Reply
#8
A lookup would require:

- Username
- Email
- Password
- Salt
- IP
Reply
#9
(02-19-2017, 11:23 PM)Daemon Wrote:  A lookup would require:

- Username
- Email
- Password
- Salt
- IP

This can easily be done. The good thing about ElasticSearch and Logstash is that it is very flexible. You can easily add/remove any fields you want by simply changing one line in the import script.
Reply
#10
(02-20-2017, 03:13 AM)LSDoom Wrote:  
(02-19-2017, 11:23 PM)Daemon Wrote:  A lookup would require:

- Username
- Email
- Password
- Salt
- IP

This can easily be done. The good thing about ElasticSearch and Logstash is that it is very flexible. You can easily add/remove any fields you want by simply changing one line in the import script.

Would the database team like a private subforum on the forum, or else a discord chat? I know Entropy hasn't joined the discord yet.
Reply
 


Forum Jump:


Users browsing this thread: 1 Guest(s)

  • Breach Forums © 2016-2018. Please read the Help Documents before posting!