Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TMK Clusterize on thousands of .tmk files #211

Open
jlekas opened this issue Oct 17, 2019 · 8 comments
Open

TMK Clusterize on thousands of .tmk files #211

jlekas opened this issue Oct 17, 2019 · 8 comments

Comments

@jlekas
Copy link

jlekas commented Oct 17, 2019

I was attempting to use the tmk-clusterize tool on a set of around 50,000 videos and the program crashed before it was finished running. Is there a limit to the number of hash files that the clusterize tool can operate on?

@johnkerl
Copy link
Contributor

We need to incorporate FAISS into the clusterizer.

That said, the program might have run out of RAM as it's a single-threaded all-in-one demo -- do you have more context on why it crashed, and/or what you saw on the terminal around the time of the crash?

@jlekas
Copy link
Author

jlekas commented Oct 21, 2019

The program was left running for 1 - 2 days and at a certain point I switched to the open terminal and had stopped running although I unfortunately do not remember the output. Do you think it would it be worthwhile attempting to fork the project and writing a multi threaded or multi processed version of the file to help it run on larger data sets, or would it be better to wait for FAISS to be incorporated into the clusterizer?

@johnkerl
Copy link
Contributor

johnkerl commented Oct 22, 2019

@jlekas both. :)

I really apologize for the delay. My team is actively working on unblocking a couple key flows in ThreatExchange and I've not been able to dedicate time to unblocking TMK+FAISS ...

@jlekas
Copy link
Author

jlekas commented Oct 22, 2019

Oh, no worries thank you so much for your timely responses and help in solving this problem. I will look into writing a multi threaded and multi processed version of the file to see if it works well for my larger data set.

@github-actions
Copy link
Contributor

This issue is being marked as stale because it has no recent activity. It will be closed automatically in 14 days
unless it becomes active before then. To prevent closing, please comment on the issue before that time. If the
issue is no longer relevant, please feel free to close it prior to that time.

Cleaning up stale issues helps redirect focus to the issues top of mind of the community. Thank you for your help
with this.

@github-actions github-actions bot added the Stale label Nov 13, 2020
@github-actions
Copy link
Contributor

This issue has been closed due to no recent activity. If you need this issue reopened, please let us know.
Thanks!

@Dcallies
Copy link
Contributor

@dxdc - How are things going after your changes - is it performant enough to consider this task closed out?

@dxdc
Copy link
Contributor

dxdc commented Apr 12, 2022

@Dcallies I've tested up to 25k tmk files, works great. There may be some better ways to optimize the parallelization, but it would be for smaller scale improvements at this point I would imagine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants