QuickStart¶
Build NLP Search in 5 Minutes¶
Log into the client as belows.
[88]:
from vectorai.client import ViClient
vi_client = ViClient(username, api_key, url)
collection_name = 'nlp_quickstart'
vi_client.delete_collection(collection_name)
Then use our Text2Vec model to convert the text to vectors!
[6]:
from vectorai.models.deployed import ViText2Vec
text_encoder = ViText2Vec(username, api_key, url)
Then insert the data as shown below!
[26]:
# %pip install datasets - we use the HuggingFace dataset.
from datasets import load_dataset
dataset = load_dataset('aeslc', split='test')
[27]:
# Add '_id' to the document ID
documents = [{'_id': str(n), **doc} for n, doc in enumerate(dataset)]
[31]:
documents[0:2]
[31]:
[{'_id': '0',
'email_body': "Phillip, Could you please do me a favor?\nI would like to read your current title policy to see what it says about easements.\nYou should have received a copy during your closing.\nI don't know how many pages it will be but let me know how you want to handle getting a copy made.\nI'll be happy to make the copy, or whatever makes it easy for you.\nThanks,\n",
'subject_line': 'Huntley/question\n'},
{'_id': '1',
'email_body': 'The following reports have been waiting for your approval for more than 4 days.\nPlease review.\nOwner: James W Reitmeyer Report Name: JReitmeyer 10/24/01 Days In Mgr.\nQueue: 5\n',
'subject_line': 'Expense Reports Awaiting Your Approval\n'}]
[64]:
vi_client.insert_documents(collection_name, documents[0:200], models={'email_body':text_encoder},
verbose=False, use_bulk_encode=True, chunksize=3)
[64]:
{'inserted_successfully': 200, 'failed': 0, 'failed_document_ids': []}
As we are inserting the data, note that the vector field names automatically adapt to our schema. In other words, the field “Email Body” ends up becoming “email_body_vector_”.
Before we search, let us take a quick look at all our data to make sure it’s been properly inserted. And now, our data is ready to search!
[74]:
vi_client.collection_stats(collection_name)
[74]:
{'size_mb': 3.653697,
'number_of_documents': 205,
'number_of_searches': 0,
'number_of_id_lookups': 0}
Let’s search through our e-mails for messages about viruses.
[87]:
result = vi_client.search(collection_name, text_encoder.encode('Emails about viruses.'),
field='email_body_vector_', page_size=5)
vi_client.results_pretty(result, 'email_body')
[87]:
email_body | |
---|---|
0 | A virus has been detected in the Enron email environment.\nThe offending email has the following characteristics : The SUBJECT is "Hi" It contains the following text : How are you ?\nWhen I saw this screen saver, I immediately thought about you I am in a harry, I promise you will love it!\nIt contains an attachment called "gone.scr" (41 KB) Please DO NOT launch the attachment (as this will launch the virus).\nDelete ALL occurances of this email from your mail box immediately.\nPlease note that opening/previewing the email does not launch the virus.\nFurthermore, it is absolutely against company policy to open any external (non-Enron) mail account from within the Enron environment.\nThis means that you should not open a browser from an Enron PC and go to any sites that offer Email services.\nWhen you do this, you are tracked and IT is aware of your actions.\nIn today's case, the virus that was launched caused a the entire global Email system to be shut down for about an hour and a half.\nIt is also against company policy to launch any attachments without saving it to your desktop first so that desktop virus scanning can take place.\nIf you did launch this virus, please call your Resolution Center/Help Desk and have your computer scanned.\nThank you for your assistance.\n |
1 | You have received this message because someone has attempted to send you an e-mail from outside of Enron with an attachment type that Enron does not allow into our messaging environment.\nYour e-mail has been quarantined and is being held at the MailSweeper server.\nSender: daphneco64@alltel.net\n |
2 | Your mailbox has exceeded one or more size limits set by your administrator.\nYour mailbox size is 113195 KB.\nMailbox size limits: \tYou will receive a warning when your mailbox reaches 75000 KB.\nYou cannot send mail when your mailbox reaches 100000 KB.You may not be able to send or receive new mail until you reduce your mailbox size.\nTo make more space available, delete any items that you are no longer using or move them to your personal folder file (.pst).\nItems in all of your mailbox folders including the Deleted Items and Sent Items folders count against your size limit.\nYou must empty the Deleted Items folder after deleting items or the space will not be freed.\nSee client Help for more information.\n |
3 | Your mailbox has exceeded one or more size limits set by your administrator.\nYour mailbox size is 112312 KB.\nMailbox size limits: \tYou will receive a warning when your mailbox reaches 75000 KB.\nYou cannot send mail when your mailbox reaches 100000 KB.You may not be able to send or receive new mail until you reduce your mailbox size.\nTo make more space available, delete any items that you are no longer using or move them to your personal folder file (.pst).\nItems in all of your mailbox folders including the Deleted Items and Sent Items folders count against your size limit.\nYou must empty the Deleted Items folder after deleting items or the space will not be freed.\nSee client Help for more information.\n |
4 | Your mailbox has exceeded one or more size limits set by your administrator.\nYour mailbox size is 112231 KB.\nMailbox size limits: \tYou will receive a warning when your mailbox reaches 75000 KB.\nYou cannot send mail when your mailbox reaches 100000 KB.You may not be able to send or receive new mail until you reduce your mailbox size.\nTo make more space available, delete any items that you are no longer using or move them to your personal folder file (.pst).\nItems in all of your mailbox folders including the Deleted Items and Sent Items folders count against your size limit.\nYou must empty the Deleted Items folder after deleting items or the space will not be freed.\nSee client Help for more information.\n |
Build Image Search In 5 Minutes¶
[6]:
collection_name = 'pokemon_images'
documents = []
for i in range(1, 20):
documents.append({
'image': 'https://assets.pokemon.com/assets/cms2/img/pokedex/full/{}.png'.format(f'{i:03}'),
'pokemon_id' : str(i),
'_id': i
})
[7]:
#1. specify the vdb client
from vectorai.client import ViClient
vi_client = ViClient(username, api_key, url)
vi_client.delete_collection(collection_name)
#2. specify an image encoder
from vectorai.models.deployed import ViImage2Vec
image_encoder = ViImage2Vec(username, api_key, url)
Logged in. Welcome public-demo. To view list of available collections, call list_collections() method.
[7]:
{'status': 'complete', 'message': 'pokemon_images deleted'}
[8]:
#3. insert the documents and encode images simultaneously
# using jobs means that the encoding process takes place on our servers as opposed to your computer
use_jobs = False
if use_jobs:
vi_client.insert_documents(collection_name, documents)
job = vi_client.encode_image_job(collection_name, 'image')
vi_client.wait_till_jobs_complete(collection_name, job['job_id'], job['job_name'])
else:
vi_client.insert_documents(collection_name, documents, models={'image':image_encoder.encode})
[8]:
{'inserted_successfully': 19, 'failed': 0, 'failed_document_ids': []}
[9]:
#4. search
search_results = vi_client.search(collection_name,
image_encoder.encode('https://assets.pokemon.com/assets/cms2/img/pokedex/full/003.png'),
'image_vector_', page_size=5)
#4.2 first result is the query audio itself
vi_client.show_json(search_results, image_fields=['image'], image_width=150)
[9]:
_id | image | pokemon_id | insert_date_ | _search_score | |
---|---|---|---|---|---|
0 | 3 | ![]() |
3 | 2020-10-02T07:07:18.799559 | 1.000000 |
1 | 2 | ![]() |
2 | 2020-10-02T07:07:18.797693 | 0.920337 |
2 | 1 | ![]() |
1 | 2020-10-02T07:07:18.795655 | 0.838996 |
3 | 17 | ![]() |
17 | 2020-10-02T07:07:30.587841 | 0.835111 |
4 | 16 | ![]() |
16 | 2020-10-02T07:07:30.585399 | 0.813012 |
[10]:
#5 recommendation by id
search_by_id_results = vi_client.search_by_id(collection_name, '2', 'image_vector_', page_size=5)
#5.2 first result is the id's audio itself
vi_client.show_json(search_by_id_results, image_fields=['image'], image_width=150)
[10]:
_id | image | pokemon_id | insert_date_ | _search_score | |
---|---|---|---|---|---|
0 | 2 | ![]() |
2 | 2020-10-02T07:07:18.797693 | 1.000000 |
1 | 3 | ![]() |
3 | 2020-10-02T07:07:18.799559 | 0.920337 |
2 | 1 | ![]() |
1 | 2020-10-02T07:07:18.795655 | 0.895991 |
3 | 7 | ![]() |
7 | 2020-10-02T07:07:18.807199 | 0.839945 |
4 | 17 | ![]() |
17 | 2020-10-02T07:07:30.587841 | 0.833277 |
Build Audio Search in 5 Minutes¶
Building Audio search is easy with Vi!
[11]:
collection_name = 'audio_quickstart'
#create the documents
documents = []
for i in range(1, 1001):
documents.append({
'audio': 'https://vecsearch-bucket.s3.us-east-2.amazonaws.com/voices/common_voice_en_{}.wav'.format(i),
'name' : 'common_voice_en_{}.wav'.format(i),
'_id': i
})
[12]:
#1. specify the vdb client
from vectorai.client import ViClient
vi_client = ViClient(username, api_key, url)
vi_client.delete_collection(collection_name)
#2. specify an audio encoder
from vectorai.models.deployed import ViAudio2Vec
audio_encoder = ViAudio2Vec(username, api_key, url)
Logged in. Welcome public-demo. To view list of available collections, call list_collections() method.
[12]:
{'status': 'complete', 'message': 'audio_quickstart deleted'}
[13]:
#3. insert the documents and encode audio simultaneously
use_jobs = True
if use_jobs:
vi_client.insert_documents(collection_name, documents)
job = vi_client.encode_audio_job(collection_name, 'audio')
vi_client.wait_till_jobs_complete(collection_name, job['job_id'], job['job_name'])
else:
vi_client.insert_documents(collection_name, documents, models={'audio':audio_encoder.encode})
d:\kda\vectorai\vectorai\read.py:351: UserWarning: Potential issue. Cannot find a vector field. Check that the vector field is _vector_.
"Potential issue. Cannot find a vector field. Check that the vector field is _vector_."
[13]:
{'inserted_successfully': 1000, 'failed': 0, 'failed_document_ids': []}
{'status': 'Finished'}
[13]:
'Done'
[14]:
import IPython.display as ipd
#4. search
search_results = vi_client.search(collection_name, audio_encoder.encode(documents[0]['audio']),
'audio_vector_', page_size=5)
vi_client.show_json(search_results, audio_fields=['audio'])
[14]:
_id | name | insert_date_ | audio | _search_score | |
---|---|---|---|---|---|
0 | 1 | common_voice_en_1.wav | 2020-10-02T07:07:34.725378 | 1.000000 | |
1 | 12 | common_voice_en_12.wav | 2020-10-02T07:07:34.726100 | 0.893219 | |
2 | 32 | common_voice_en_32.wav | 2020-10-02T07:07:35.271638 | 0.891373 | |
3 | 20 | common_voice_en_20.wav | 2020-10-02T07:07:35.046128 | 0.882336 | |
4 | 15 | common_voice_en_15.wav | 2020-10-02T07:07:34.726251 | 0.877323 |
[15]:
#5 recommendation by id
search_by_id_results = vi_client.search_by_id(collection_name, '2', 'audio_vector_', page_size=5)
vi_client.show_json(search_by_id_results, audio_fields=['audio'])
[15]:
_id | name | insert_date_ | audio | _search_score | |
---|---|---|---|---|---|
0 | 2 | common_voice_en_2.wav | 2020-10-02T07:07:34.725536 | 1.000000 | |
1 | 40 | common_voice_en_40.wav | 2020-10-02T07:07:35.272603 | 0.884632 | |
2 | 3 | common_voice_en_3.wav | 2020-10-02T07:07:34.725629 | 0.879187 | |
3 | 14 | common_voice_en_14.wav | 2020-10-02T07:07:34.726200 | 0.874556 | |
4 | 21 | common_voice_en_21.wav | 2020-10-02T07:07:35.046224 | 0.865409 |
Build Text QA Search in 5 minutes¶
[16]:
%pip install datasets
Requirement already satisfied: datasets in c:\users\jacky\anaconda3\lib\site-packages (1.0.1)
Requirement already satisfied: filelock in c:\users\jacky\anaconda3\lib\site-packages (from datasets) (3.0.12)
Requirement already satisfied: dill in c:\users\jacky\anaconda3\lib\site-packages (from datasets) (0.3.1.1)
Requirement already satisfied: pyarrow>=0.17.1 in c:\users\jacky\anaconda3\lib\site-packages (from datasets) (1.0.0)
Requirement already satisfied: requests>=2.19.0 in c:\users\jacky\anaconda3\lib\site-packages (from datasets) (2.22.0)
Requirement already satisfied: numpy>=1.17 in c:\users\jacky\anaconda3\lib\site-packages (from datasets) (1.19.1)
Requirement already satisfied: pandas in c:\users\jacky\anaconda3\lib\site-packages (from datasets) (0.25.2)
Requirement already satisfied: tqdm>=4.27 in c:\users\jacky\anaconda3\lib\site-packages (from datasets) (4.36.1)
Requirement already satisfied: xxhash in c:\users\jacky\anaconda3\lib\site-packages (from datasets) (2.0.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\jacky\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\jacky\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (1.24.2)
Requirement already satisfied: idna<2.9,>=2.5 in c:\users\jacky\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\jacky\anaconda3\lib\site-packages (from requests>=2.19.0->datasets) (3.0.4)
Requirement already satisfied: pytz>=2017.2 in c:\users\jacky\anaconda3\lib\site-packages (from pandas->datasets) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\users\jacky\anaconda3\lib\site-packages (from pandas->datasets) (2.8.0)
Requirement already satisfied: six>=1.5 in c:\users\jacky\anaconda3\lib\site-packages (from python-dateutil>=2.6.1->pandas->datasets) (1.12.0)
Note: you may need to restart the kernel to use updated packages.
[ ]:
collection_name = 'squad'
#use huggingface's datasets library to download squad
import datasets
squad_dataset = datasets.load_dataset('squad')
documents = [{'_id':str(n), **d} for n, d in enumerate(squad_dataset['validation'])]
vi_client.delete_collection(collection_name)
[1]:
#1. specify the vdb client
from vectorai.client import ViClient
vi_client = ViClient(username, api_key, url)
vi_client.delete_collection(collection_name)
#2. specify a text encoder
from vectorai.models.deployed import ViText2Vec
text_encoder = ViText2Vec(username, api_key, 'https://api.vctr.ai')
[20]:
#3. insert the documents and encode text simultaneously
use_jobs = True
if use_jobs:
vi_client.insert_documents(collection_name, documents)
job = vi_client.encode_text_job(collection_name, 'question')
vi_client.wait_till_jobs_complete(collection_name, job['job_id'], job['job_name'])
else:
vi_client.insert_documents(collection_name, documents, models={'question':text_encoder}, use_bulk_encode=True)
d:\kda\vectorai\vectorai\read.py:351: UserWarning: Potential issue. Cannot find a vector field. Check that the vector field is _vector_.
"Potential issue. Cannot find a vector field. Check that the vector field is _vector_."
[20]:
{'inserted_successfully': 10570, 'failed': 0, 'failed_document_ids': []}
{'status': 'Finished'}
[20]:
'Done'
[21]:
#4. search
search_results = vi_client.search(collection_name,
text_encoder.encode('who was the winner for nfl fifty'),
'question_vector_', page_size=5)
#4.2 first result is the query text itself
vi_client.results_to_df(search_results)
[21]:
_id | question | answers | context | insert_date_ | id | title | _search_score | |
---|---|---|---|---|---|---|---|---|
0 | 11 | Who won Super Bowl 50? | {'answer_start': [177, 177, 177], 'text': ['De... | Super Bowl 50 was an American football game to... | 2020-10-02T07:47:06.947313 | 56beace93aeaaa14008c91df | Super_Bowl_50 | 0.798744 |
1 | 24 | Who won Super Bowl 50? | {'answer_start': [177, 177, 177], 'text': ['De... | Super Bowl 50 was an American football game to... | 2020-10-02T07:47:07.285182 | 56d20362e7d4791d009025eb | Super_Bowl_50 | 0.798744 |
2 | 3 | Which NFL team won Super Bowl 50? | {'answer_start': [177, 177, 177], 'text': ['De... | Super Bowl 50 was an American football game to... | 2020-10-02T07:47:06.946694 | 56be4db0acb8001400a502ef | Super_Bowl_50 | 0.763209 |
3 | 55 | Who was the Super Bowl 50 MVP? | {'answer_start': [248, 248, 252], 'text': ['Vo... | The Broncos took an early lead in Super Bowl 5... | 2020-10-02T07:47:07.759154 | 56be4eafacb8001400a50302 | Super_Bowl_50 | 0.754090 |
4 | 26 | Which team won Super Bowl 50. | {'answer_start': [177, 177, 177], 'text': ['De... | Super Bowl 50 was an American football game to... | 2020-10-02T07:47:07.285403 | 56d600e31c85041400946eb0 | Super_Bowl_50 | 0.742759 |
[22]:
#5 recommendation by id
search_by_id_results = vi_client.search_by_id(collection_name, documents[50]['_id'], 'question_vector_', page_size=5)
#5.2 first result is the id's text itself
vi_client.results_to_df(search_by_id_results)
[22]:
_id | question | answers | context | insert_date_ | id | title | _search_score | |
---|---|---|---|---|---|---|---|---|
0 | 50 | Who did Denver beat in the 2015 AFC Championsh... | {'answer_start': [372, 368, 372], 'text': ['Ne... | The Panthers finished the regular season with ... | 2020-10-02T07:47:07.758767 | 56d6017d1c85041400946ec1 | Super_Bowl_50 | 1.000000 |
1 | 48 | Who did Denver beat in the AFC championship? | {'answer_start': [372, 368, 372], 'text': ['Ne... | The Panthers finished the regular season with ... | 2020-10-02T07:47:07.758541 | 56d2045de7d4791d009025f6 | Super_Bowl_50 | 0.960072 |
2 | 331 | Who did the Broncos beat to win their division... | {'answer_start': [25, 25, 36], 'text': ['Pitts... | The Broncos defeated the Pittsburgh Steelers i... | 2020-10-02T07:47:12.209038 | 56d99f99dc89441400fdb628 | Super_Bowl_50 | 0.923735 |
3 | 330 | Who did the Broncos defeat in the AFC Champion... | {'answer_start': [192, 192, 204], 'text': ['Ne... | The Broncos defeated the Pittsburgh Steelers i... | 2020-10-02T07:47:12.208876 | 56d7018a0d65d214001982c5 | Super_Bowl_50 | 0.915792 |
4 | 328 | Who did the Broncos beat in the divisional game? | {'answer_start': [25, 21, 36], 'text': ['Pitts... | The Broncos defeated the Pittsburgh Steelers i... | 2020-10-02T07:47:11.956089 | 56d7018a0d65d214001982c2 | Super_Bowl_50 | 0.906187 |
[23]:
#6 hybrid search combining traditional and nlp vector search
search_results = vi_client.hybrid_search(collection_name, 'Peyton Men',
text_encoder.encode('Peyton Men'),
['question_vector_'], ['question'],
traditional_weight=0.015,
page_size=5)
vi_client.results_to_df(search_results)
[23]:
_id | question | answers | context | insert_date_ | id | title | _search_score | |
---|---|---|---|---|---|---|---|---|
0 | 258 | How old was Peyton Manning in 2015? | {'answer_start': [817, 817, 817], 'text': ['39... | Following their loss in the divisional round o... | 2020-10-02T07:47:11.000830 | 56bf301c3aeaaa14008c9550 | Super_Bowl_50 | 0.641220 |
1 | 276 | How may yards did Peyton Manning throw? | {'answer_start': [77, 77, 77], 'text': ['2,249... | Manning finished the year with a career-low 67... | 2020-10-02T07:47:11.239195 | 56bf38383aeaaa14008c956c | Super_Bowl_50 | 0.634783 |
2 | 270 | What was Peyton Manning's passer rating for th... | {'answer_start': [44, 44, 44], 'text': ['67.9'... | Manning finished the year with a career-low 67... | 2020-10-02T07:47:11.238646 | 56beb57b3aeaaa14008c9279 | Super_Bowl_50 | 0.617874 |
3 | 252 | Who did Peyton Manning play for as a rookie? | {'answer_start': [641, 637, 654], 'text': ['In... | Following their loss in the divisional round o... | 2020-10-02T07:47:10.760423 | 56beb4e43aeaaa14008c9267 | Super_Bowl_50 | 0.612926 |
4 | 356 | Peyton Manning took how many different teams t... | {'answer_start': [57, 57, 57, 57], 'text': ['t... | Peyton Manning became the first quarterback ev... | 2020-10-02T07:47:12.428915 | 56d704430d65d214001982de | Super_Bowl_50 | 0.611716 |
[ ]: